# Baseline Experiment

This experiment was run on September 8, 2022 to test the baseline performance of gRPC streaming and the blast benchmarking utility.

This is an initial test to determine the "baseline" performance of gRPC streaming; e.g. without any data processing how many messages/bytes per second can we push through a bidirectional stream? This will give us the upper limit for how fast a publisher can publish messages. 

**Note:** this is not necessarily the performance of an Ensign server, which could have multiple publish streams open at once, we need to run an experiment with concurrent publishers to determine that; this is just the maximum throughput of a single publish stream. 

This experiment is run on my local Mac with a _single_ publisher and a single Ensign server connected via the local loop back network (127.0.0.1). We will conduct more formal experiments running this benchmark on pods in a k8s cluster in the future. 

**Method**

The Ensign server implements the `Publish` stream with two go routines: one to `Recv` and the other to `Send` acknowledgements back. The `Recv` routine loops until the stream is closed, collecting messages one at a time and putting them on a channel with a 10000 message buffer. The `Send` routine reads that channel until it's closed and sends ACKS back to the client. No other data processing occurs and the work is all in memory. 

The benchmark used is the "blast" benchmark. A single benchmark run creates and allocates in memory 10000 event messages with random data (default 8192 bytes, but one experiment uses variable sized events). The publish stream is opened and the timer is started. Two go routines are launched, one to send all of the 10k events up the publish stream and the other to retrieve the acks. The messages are assumed to be ordered, so the latency is defined as the time from the send to the ack. The throughput is the total number of bytes or messages divided by the total time it took to send 10k messages and receive 10k acks. 

**Platform**:

- OS X Monterey (v12.5.1)
- Apple M1 Max 64 GB memory (10 cores)
- Networking: localhost (127.0.0.1)
- Go 1.19 build
- gRPC version 1.49.0

**System**:
- enbench version 0.1 ([9217954](https://github.com/rotationalio/ensign-benchmarks/commit/9217954c385e1590b1b72ee6c9c68901dd219402))
- ensign version 0.1 ([6c2ce7a](https://github.com/rotationalio/ensign/commit/6c2ce7a8998cf014d4af29e4d3577ad77e009cb7)) *note the build happened before the commit which is why the commit hash is wrong in the data. 

## Data Loading and Parsing

In [1]:
%matplotlib notebook

In [2]:
!ls

fixed_size.jsonlines    [31mrun.sh[m[m
results.ipynb           variable_size.jsonlines


In [3]:
import json 

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
def parse_timedelta(ts):
    if ts.endswith('ns'):
        return pd.Timedelta(nanoseconds=float(ts.replace('ns', '')))
    
    if ts.endswith('µs'):
        return pd.Timedelta(microseconds=float(ts.replace('µs', '')))
    
    if ts.endswith('ms'):
        return pd.Timedelta(milliseconds=float(ts.replace('ms', '')))
    
    if ts.endswith('s'):
        return pd.Timedelta(seconds=float(ts.replace('s', '')))

    
def load_data(path):
    with open(path, 'r') as f:
        for line in f:
            yield json.loads(line)


def load_dataframe(path):
    data = []
    for row in load_data(path):
        data.append({
            'bandwidth': row['bandwidth'],
            'events': row['events'],
            'data_size': row['experiment']['data_size'],
            'failures': row['failures'],
            'throughput': row['latencies']['throughput'],
            'duration': parse_timedelta(row['latencies']['duration']),
            'latency': parse_timedelta(row['latencies']['mean']),
            'laency_stddev': parse_timedelta(row['latencies']['stddev']),
        })
    
    return pd.DataFrame(data)

## Fixed Size Experiment

This experiment focuses on throughput in messages per second. The blast benchmark runs 10k messages and measures the latency and throughput for the run. Blast was run 257 times; therefore these results are the mean of means - e.g. the mean latency and throughput for 10k messages at a time. 

In [5]:
df = load_dataframe('fixed_size.jsonlines')
df.describe()

Unnamed: 0,bandwidth,events,data_size,failures,throughput,duration,latency,laency_stddev
count,257.0,257.0,257.0,257.0,257.0,257,257,257
mean,1317602000.0,10000.0,8192.0,0.0,160840.049439,0 days 00:00:00.062451097,0 days 00:00:00.000613521,0 days 00:00:00.000188804
std,86610680.0,0.0,0.0,0.0,10572.592204,0 days 00:00:00.004258903,0 days 00:00:00.000110773,0 days 00:00:00.000054554
min,989742000.0,10000.0,8192.0,0.0,120818.120872,0 days 00:00:00.054552500,0 days 00:00:00.000365434,0 days 00:00:00.000132579
25%,1259203000.0,10000.0,8192.0,0.0,153711.360807,0 days 00:00:00.059376875,0 days 00:00:00.000538725,0 days 00:00:00.000161711
50%,1320215000.0,10000.0,8192.0,0.0,161159.05593,0 days 00:00:00.062050500,0 days 00:00:00.000601019,0 days 00:00:00.000177498
75%,1379662000.0,10000.0,8192.0,0.0,168415.73424,0 days 00:00:00.065057,0 days 00:00:00.000676525,0 days 00:00:00.000203307
max,1501673000.0,10000.0,8192.0,0.0,183309.655836,0 days 00:00:00.082769041,0 days 00:00:00.001365572,0 days 00:00:00.000696153


In [6]:
f = sns.displot(data=df, x='throughput', kde=True, rug=True)
_ = f.ax.set_xlabel("Throughput (events/second)")
_ = f.ax.set_title("Single Publisher Eventing Throughput")
plt.tight_layout()

<IPython.core.display.Javascript object>

In [7]:
df['mbs'] = df['bandwidth'] / 1e6
f = sns.displot(data=df, x='mbs', kde=True, rug=True)
_ = f.ax.set_xlabel("Throughput (Mb/second)")
_ = f.ax.set_title("Single Publisher Data Throughput (8KiB Events)")
plt.tight_layout()

<IPython.core.display.Javascript object>

In [20]:
df['latency_ms'] = df['latency'].map(lambda ts: ts.delta / 1e6)
f = sns.displot(data=df, x='latency_ms', kde=True, rug=True)
_ = f.ax.set_xlabel("Mean Latency (ms)")
_ = f.ax.set_title("Single Publisher Event Latency")
plt.tight_layout()

<IPython.core.display.Javascript object>

## Variable Size Experiment

This experiment investigates the effect of event size on throughput. The size of the data in the event was increased from 1024 bytes to 1024*1024 bytes in steps of 32 bytes. The blast benchmark was run on each data size 65 times.

In [21]:
df = load_dataframe('variable_size.jsonlines')
df.describe()

Unnamed: 0,bandwidth,events,data_size,failures,throughput,duration,latency,laency_stddev
count,2210.0,2210.0,2210.0,2210.0,2210.0,2210,2210,2210
mean,2785822000.0,10000.0,524800.0,0.0,22386.849162,0 days 00:00:01.817940062,0 days 00:00:00.001157097,0 days 00:00:00.000283712
std,453698500.0,0.0,311501.6,0.0,71134.411293,0 days 00:00:01.065707234,0 days 00:00:00.000492937,0 days 00:00:00.000082660
min,394998200.0,10000.0,1024.0,0.0,2653.787864,0 days 00:00:00.021617917,0 days 00:00:00.000397717,0 days 00:00:00.000136657
25%,2798072000.0,10000.0,254976.0,0.0,3571.307896,0 days 00:00:00.864209249,0 days 00:00:00.000760372,0 days 00:00:00.000241693
50%,2874390000.0,10000.0,524800.0,0.0,5479.720056,0 days 00:00:01.824910875,0 days 00:00:00.001107273,0 days 00:00:00.000277334
75%,2951934000.0,10000.0,794624.0,0.0,11571.27455,0 days 00:00:02.800094614,0 days 00:00:00.001393746,0 days 00:00:00.000311539
max,3465271000.0,10000.0,1048576.0,0.0,462579.257752,0 days 00:00:03.768198708,0 days 00:00:00.002732599,0 days 00:00:00.001809739


In [31]:
df['data_size_kib'] = df['data_size'] / 1024
f = sns.lmplot(data=df, x='data_size_kib', y='throughput', logx=True, height=4, aspect=2.25)
_ = f.ax.set_xlabel("Event Data Size (KiB)")
_ = f.ax.set_ylabel("Throughput (events/second)")
_ = f.ax.set_title("Single Publisher Eventing Throughput")
plt.tight_layout()

<IPython.core.display.Javascript object>

In [37]:
df['mbs'] = df['bandwidth'] / 1e6
f = sns.lmplot(data=df, x='data_size_kib', y='mbs', logx=True, height=4, aspect=2.25)
_ = f.ax.set_xlabel("Event Data Size (KiB)")
_ = f.ax.set_ylabel("Throughput (Mb/second)")
_ = f.ax.set_title("Single Publisher Data Throughput")
plt.tight_layout()

<IPython.core.display.Javascript object>

In [39]:
df['latency_ms'] = df['latency'].map(lambda ts: ts.delta / 1e6)
f = sns.displot(data=df, x='data_size_kib', y='latency_ms')
_ = f.ax.set_xlabel("Event Data Size (KiB)")
_ = f.ax.set_ylabel("Mean Latency (ms)")
_ = f.ax.set_title("Single Publisher Event Latency")
plt.tight_layout()

<IPython.core.display.Javascript object>