Skip to content

Commit

Permalink
Merge 4aab719 into 4ee3493
Browse files Browse the repository at this point in the history
  • Loading branch information
xnuter committed Jan 23, 2021
2 parents 4ee3493 + 4aab719 commit c1472d8
Show file tree
Hide file tree
Showing 35 changed files with 983 additions and 116 deletions.
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "perf-gauge"
version = "0.1.1"
version = "0.1.2"
authors = ["Eugene Retunsky"]
license = "MIT OR Apache-2.0"
edition = "2018"
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ It works in the following modes:
1. Increase the request rate linearly, e.g. by `1,000` every minute to see how your service scales with load.
1. It can report metrics to `Prometheus` via a `pushgateway`.

For instance: ![](./examples/prom/http-tunnel-rust-latency.png).
For instance: ![](./examples/prom/baseline-nginx-stable-p50-99.png).

Emitted metrics are:
* `request_count` - counter for all requests
Expand Down
68 changes: 36 additions & 32 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,38 +16,38 @@ TL;DR; you can jump right to [Benchmarks](#benchmarks) and look into methodology

### Types of load

There are three types of load to compare different aspects TCP proxies:
There are three types of load to compare different aspects of TCP proxies:

* `moderate load` - `25k RPS` (requests per second). Connections are being re-used for `50` requests.
* In this mode we benchmark handling traffic over persisted connections.
* Moderate request rate is chosen to benchmark proxies under _normal_ conditions.
* In this mode, we benchmark handling traffic over persisted connections.
* Moderate request rate is chosen to benchmark proxies under _normal_ conditions.
* `max load` - sending as many requests as the server can handle.
* The intent is to test the proxies under stress conditions.
* Also, we find the max throughput of the service (the saturation point).
* `no-keepalive` - using each connection for a single request
* `no-keepalive` - using each connection for a single request
* So we can compare the performance characteristics of establishing new connections.
* Establishing a connection is an expensive operation.
* It involves resource allocation and dispatching tasks between worker threads.
* As well as clean-up operations once a connection is closed.
* As well as clean-up operations once a connection is closed.

### Compared metrics

To compare different solutions, we use the following set of metrics:

* Latency (in microseconds, or `µs`)
* `p50` (median) - a value that is greater than 50% of observed latency samples
* `p90` - 90th percentile, or a value that is better than 9 out 10 latency samples. Usually a good proxy for a perceivable latency by humans.
* `p90` - 90th percentile, or a value that is better than 9 out of 10 latency samples. Usually a good proxy for a perceivable latency by humans.
* `p99` - 99th percentile, the threshold for the worst 1% of samples.
* tail-latencies: `p99.9` and `p99.99` - may be important for systems with multiple network hops or large fan-outs (e.g. a request gathers data from tens or hundreds microservices)
* `max` - the worst-case.
* tail-latencies: `p99.9` and `p99.99` - may be important for systems with multiple network hops or large fan-outs (e.g., a request gathers data from tens or hundreds of microservices)
* `max` - the worst-case.
* `tm99.9` - trimmed mean, or the mean value of all samples without the best and worst 0.1%. It is more useful than the traditional mean, as it removes a potentially disproportionate influence of outliers: https://en.wikipedia.org/wiki/Truncated_mean
* `stddev` - the standard deviation of the latency. The lower the better: https://en.wikipedia.org/wiki/Standard_deviation
* Throughput `rps` (requests per second)
* `stddev` - the standard deviation of the latency. The lower, the better: https://en.wikipedia.org/wiki/Standard_deviation
* Throughput `rps` (requests per second)
* CPU utilization
* Memory utilization

In other words, we primarily focus on the latency, but also keep an eye on the cost of that latency in terms of CPU/Memory.
For the `max load` we also assess the maximum possible throughput of the system.
We primarily focus on the latency and keep an eye on the cost of that latency in terms of CPU/Memory.
For the `max load,` we also assess the maximum possible throughput of the system.

#### Trimmed mean vs median

Expand All @@ -57,18 +57,20 @@ Why do we need to observe trimmed mean if we already have median (i.e. `p50`)?
* `1,2,3,4,5,6,7,8,9,10` - `p50` is `5`, `trimmed mean` is `5`
* `5,5,5,5,5,6,7,8,9,10` - `p50` is still `5`, however the `trimmed mean` is `6.25`.

The same is applicable to any other percentile. If the team only uses `p90` or `p99` to monitor the performance of their system, they may miss dramatic regressions without being aware of that.
The same applies to any other percentile. If the team only uses `p90` or `p99` to monitor their system's performance, they may miss dramatic regressions without being aware of that.

Of course, we may use multiple `fences` (`p10`, `p25`, etc.) - but why, if we can use a single metric?
In contrast, the traditional mean is susceptible to noise and outliers and not as good for capturing the general tendency.
In contrast, the traditional mean is susceptible to noise and outliers and not as good for capturing the general tendency.

### Compared configurations

These benchmarks compare TCP proxies written in different languages, which use Non-blocking I/O.
Why TCP proxies? This is the simplest application dealing with the network I/O. All it does, is connection establishment and forwarding traffic.
Why TCP proxies? This is the simplest application dealing with the network I/O. All it does is connection establishment and forward traffic.
Why Non-blocking I/O? You can read [this post](https://medium.com/swlh/distributed-systems-and-asynchronous-i-o-ef0f27655ce5), which tries to demonstrate why
Non-blocking I/O is a much better option for network applications.

Let's say, you're building a network service. TCP proxy benchmarks are the lower boundary for the request latency it may have.
Everything else is added on top of that (e.g. parsing, validating, packing, traversing, construction of data, etc.).
Let's say you're building a network service. TCP proxy benchmarks are the lower boundary for the request latency it may have.
Everything else is added on top of that (e.g., parsing, validating, packing, traversing, construction of data, etc.).

So the following solutions are being compared:

Expand All @@ -80,18 +82,20 @@ So the following solutions are being compared:
* `NetCrusher` - a Java solution (Java NIO): https://github.com/NetCrusherOrg/NetCrusher-java/
* `pproxy` - a Python solution based on `asyncio` (running in TCP Proxy mode): https://pypi.org/project/pproxy/

Thanks to [Cesar Mello](https://github.com/cmello/) who coded the TCP proxy in C++ to make this benchmark possible.

## Testbed

Benchmarking network services is tricky, especially if we need to measure difference down to microseconds granularity.
To rule out network delays/noise we can try to employ one of the options:
Benchmarking network services is tricky, especially if we need to measure differences down to microseconds granularity.
To rule out network delays/noise, we can try to employ one of the options:

* use co-located servers, e.g. VMs on the same physical machine, or in the same rack.
* use a single VM, but assign CPU cores to different components to avoid overlap
* use co-located servers, e.g., VMs on the same physical machine or in the same rack.
* use a single VM, but assign CPU cores to different components to avoid overlap

Both are not ideal, but the latter seem to be an easier way. We just need to make sure, that the instance type is CPU optimized,
and it won't suffer from noisy-neighbor issues. In other words, it must have exclusive access to all cores as we're going to drive CPU utilization close to 100%.
Both are not ideal, but the latter seems to be an easier way. We need to make sure that the instance type is CPU optimized
and won't suffer from noisy-neighbor issues. In other words, it must have exclusive access to all cores as we're going to drive CPU utilization close to 100%.

E.g. if we use an 8-core machine, we can use the following assignment scheme:
E.g., if we use an 8-core machine, we can use the following assignment scheme:

* Cores 0-1: Nginx (serves `10kb` of payload per request)
* Cores 2-3: TCP proxy
Expand All @@ -103,11 +107,11 @@ This can be achieved by using [cpu sets](https://codywu2010.wordpress.com/2015/0
apt-get install cgroup-tools
```

Then we can create non-overlapping cpu-sets and run different components without competing for CPU and ruling out any network noise.
Then we can create non-overlapping CPU-sets and run different components without competing for CPU and ruling out any network noise.

### Prometheus

`perf-gauge` can emit metrics to `Prometheus.` To launch a stack, you can use https://github.com/xnuter/prom-stack
`perf-gauge` can emit metrics to `Prometheus.` To launch a stack, you can use https://github.com/xnuter/prom-stack.
I just forked `prom-stack` and removed anything but `prometheus,` `push-gateway` and `grafana.` You can clone the stack and launch `make.`

Then set the variable with the host, for instance:
Expand All @@ -118,26 +122,26 @@ export PROMETHEUS_HOST=10.138.0.2

### Configurations

Please note, that for all configuration we disable logging to minimize the number of variables, and the level of noise.
Please note that we disable logging for all configurations to minimize the number of variables and the level of noise.

* [Perf-gauge](./perf-gauge-setup.md)
* [Nginx](nginx-config.md)
* TCP Proxies
* TCP Proxies
* [HAProxy - C](haproxy-config.md)
* [draft-http-tunnel - C++](cpp-config.md)
* [http-tunnel - Rust](rust-config.md)
* [tcp-roxy - Golang](golang-config.md)
* [tcp-proxy - Golang](golang-config.md)
* [NetCrusher - Java](java-config.md)
* [pproxy - Python](python-config.md)

## Benchmarks

Okay, we finally got to benchmark results. All benchmark results are split to two batches:
Okay, we finally got to benchmark results. All benchmark results are split into two batches:

* Baseline, C, C++, Rust - comparing high-performance solutions
* Rust, Golang, Java, Python - comparing memory-safe languages

Yep, Rust belongs to both worlds.
Yep, Rust belongs to both worlds.

* [Moderate RPS](./moderate-tps.md)
* [Max RPS](./max-tps.md)
Expand Down
2 changes: 1 addition & 1 deletion examples/golang-config.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
### Running tpc-proxy (Golang)
### Running TCP-proxy (Golang)

Repository: https://github.com/ickerwx/tcpproxy/

Expand Down
2 changes: 1 addition & 1 deletion examples/haproxy-config.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### Setting up HAProxy

We need to specify TCP frontend and backend. It's important to turn off logging, otherwise it would flood the disk.
We need to specify TCP frontend and backend. It's important to turn off logging. Otherwise, it would flood the disk.
Also, it should only use cores #2 and #3:

```
Expand Down
66 changes: 32 additions & 34 deletions examples/max-tps.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,30 @@
- [Max TPS](#max-tps)
* [High-performance (C, C++, Rust)](#high-performance--c--c----rust-)
* [High-performance (C, C++, Rust)](#high-performance-c-c-rust)
+ [Maximum rate achieved](#maximum-rate-achieved)
+ [Regular percentiles (p50,90,99)](#regular-percentiles--p50-90-99-)
+ [Tail latency (p99.9 and p99.99)](#tail-latency--p999-and-p9999-)
+ [Regular percentiles (p50,90,99)](#regular-percentiles-p509099)
+ [Tail latency (p99.9 and p99.99)](#tail-latency-p999-and-p9999)
+ [Trimmed mean and standard deviation](#trimmed-mean-and-standard-deviation)
+ [CPU consumption](#cpu-consumption)
+ [Summary](#summary)
* [Memory-safe languages (Rust, Golang, Java, Python)](#memory-safe-languages--rust--golang--java--python-)
* [Memory-safe languages (Rust, Golang, Java, Python)](#memory-safe-languages-rust-golang-java-python)
+ [Maximum rate achieved](#maximum-rate-achieved-1)
+ [Regular percentiles (p50,90,99)](#regular-percentiles--p50-90-99--1)
+ [Tail latency (p99.9 and p99.99)](#tail-latency--p999-and-p9999--1)
+ [Regular percentiles (p50,90,99)](#regular-percentiles-p509099-1)
+ [Tail latency (p99.9 and p99.99)](#tail-latency-p999-and-p9999-1)
+ [Trimmed mean and standard deviation](#trimmed-mean-and-standard-deviation-1)
+ [CPU consumption](#cpu-consumption-1)
+ [Summary](#summary-1)
* [Total summary](#total-summary)
* [Conclusion](#conclusion)

<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>

### Max TPS

The load is generated without any rate limiting with the concurrency setting `100`.
The load is generated without any rate-limiting with the concurrency setting `100`.

#### High-performance (C, C++, Rust)

##### Maximum rate achieved

The most interesting question, is how much RPS each solution can handle?
The most interesting question is how much RPS each solution can handle?

While Nginx is capable of handling `~60k` requests per second (impressive for just two cores!),
all three C/C++/Rust are somewhat comparable (but C++ handled slightly more requests):
Expand All @@ -35,34 +33,34 @@ all three C/C++/Rust are somewhat comparable (but C++ handled slightly more requ
* C++ - 48.8k
* Rust - 46k

![](./prom/max-baseline-c-cpp-rust-rps.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-rps.png)

##### Regular percentiles (p50,90,99)

The results are somewhat mixed again. While C++ showed better `p50`, it's `p99` is worse.
At the `p90` level all three are close:
At the `p90` level, all three are close:

![](./prom/max-baseline-c-cpp-rust-p50-99.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-p50-99.png)

##### Tail latency (p99.9 and p99.99)

For the tail latency, Rust is better than both C and C++:

![](./prom/max-baseline-c-cpp-rust-tail.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-tail.png)

##### Trimmed mean and standard deviation

All three are nearly identical, however C++ is a tiny bit better (see the table below for the numbers):
All three are nearly identical. However, C++ is a tiny bit better (see the table below for the numbers):

![](./prom/max-baseline-c-cpp-rust-mean.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-mean.png)

##### CPU consumption

CPU utilization is important here. What we want, is to saturate the CPU as much as we can.
CPU utilization is important here. What we want is to saturate the CPU as much as we can.

Baseline CPU Utilization is 73%, but in fact it is 93% of available cores (as cores 2 and 3 were not used).
Baseline CPU Utilization is 73%, but in fact, it is 93% of available cores (as cores 2 and 3 were not used).

![](./prom/max-baseline-c-cpp-rust-cpu.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-cpu.png)

| | CPU Utilization |
|---|---|
Expand All @@ -71,8 +69,8 @@ Baseline CPU Utilization is 73%, but in fact it is 93% of available cores (as co
|C++ |96%|
|Rust |93%|

Which means that C++ managed to use more CPU and spent more time handling requests.
However, it's worth mentioning that the `draft-http-tunnel` is implemented using callbacks, while the Rust solution is based on `tokio`,
This means that C++ managed to use more CPU and spent more time handling requests.
However, it's worth mentioning that the `draft-http-tunnel` is implemented using callbacks, while the Rust solution is based on `tokio,`
which is a feature-rich framework and is much more flexible and extendable.

##### Summary
Expand All @@ -88,45 +86,45 @@ which is a feature-rich framework and is much more flexible and extendable.

##### Maximum rate achieved

Among memory safe both Rust and Golang showed comparable throughput, while Java and Python were significantly behind:
Among memory safe, both Rust and Golang showed comparable throughput, while Java and Python were significantly behind:

* Rust - 46k
* Golang - 42.6k
* Java - 25.9k
* Python - 18.3k

![](./prom/max-rust-golang-java-python-rps.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-rps.png)

##### Regular percentiles (p50,90,99)

Again, we can see, that at `p50`-`p90` level Golang is somewhat comparable to Rust,
Again, we can see that at `p50`-`p90` level, Golang is somewhat comparable to Rust,
but quickly deviates at `p99` level, adding almost two milliseconds.

Java and Python exhibit substantially higher latencies, but Java `p99` latency is much worse than Python:

![](./prom/max-rust-golang-java-python-p50-99.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-p50-99.png)

##### Tail latency (p99.9 and p99.99)

Tail latency shows even larger difference with Rust, and for Java is the worst of all four:
Tail latency shows an even larger difference with Rust, and for Java is the worst of all four:

![](./prom/max-rust-golang-java-python-tail.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-tail.png)

##### Trimmed mean and standard deviation

Golang's is comparable to Rust, which is impressive.
Again, Java and Python are well behind both Rust and Golang:

![](./prom/max-rust-golang-java-python-mean.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-mean.png)

##### CPU consumption

CPU utilization is important here. What we want, is to saturate the CPU as much as we can.
CPU utilization is important here. What we want is to saturate the CPU as much as we can.

As we can see, Rust does the best job of utilizing resources while Golang, Java and Python (in this order)
As we can see, Rust does the best job of utilizing resources while Golang, Java, and Python (in this order)
allow more power to stay idle.

![](./prom/max-rust-golang-java-python-cpu.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-cpu.png)

| | CPU Utilization |
|---|---|
Expand Down Expand Up @@ -159,13 +157,13 @@ allow more power to stay idle.
#### Conclusion

The Rust solution is on par with C/C++ solutions at all levels.
Golang is slightly worse, especially for tail latencies, but is close to high performance languages.
Golang is slightly worse, especially for tail latencies, but is close to high-performance languages.

NetCrusher and pproxy have much worse throughput and latency characteristics if a network service is under heavy load.
But, NetCrusher (Java) showed the worst max latency measured in seconds:

![](./prom/max-rust-golang-java-python-max.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-max.png)

BTW, try to guess Java on the memory consumption graph:

![](./prom/java-vs-others-memory.png)
![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/java-vs-others-memory.png)
Loading

0 comments on commit c1472d8

Please sign in to comment.