Merge 4aab719 into 4ee3493

xnuter · Jan 23, 2021 · c1472d8 · c1472d8
2 parents 4ee3493 + 4aab719
commit c1472d8
Show file tree

Hide file tree

Showing 35 changed files with 983 additions and 116 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "perf-gauge"
-version = "0.1.1"
+version = "0.1.2"
 authors = ["Eugene Retunsky"]
 license = "MIT OR Apache-2.0"
 edition = "2018"

diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ It works in the following modes:
 1. Increase the request rate linearly, e.g. by `1,000` every minute to see how your service scales with load.
 1. It can report metrics to `Prometheus` via a `pushgateway`.
 
-For instance: ![](./examples/prom/http-tunnel-rust-latency.png).
+For instance: ![](./examples/prom/baseline-nginx-stable-p50-99.png).
 
 Emitted metrics are:
 * `request_count` - counter for all requests

diff --git a/examples/README.md b/examples/README.md
@@ -16,38 +16,38 @@ TL;DR; you can jump right to [Benchmarks](#benchmarks) and look into methodology
 
 ### Types of load
 
-There are three types of load to compare different aspects TCP proxies:
+There are three types of load to compare different aspects of TCP proxies:
 
 * `moderate load` - `25k RPS` (requests per second). Connections are being re-used for `50` requests.
-  * In this mode we benchmark handling traffic over persisted connections.
-  * Moderate request rate is chosen to benchmark proxies under _normal_ conditions. 
+  * In this mode, we benchmark handling traffic over persisted connections.
+  * Moderate request rate is chosen to benchmark proxies under _normal_ conditions.
 * `max load` - sending as many requests as the server can handle.
   * The intent is to test the proxies under stress conditions.
   * Also, we find the max throughput of the service (the saturation point).
-* `no-keepalive` - using each connection for a single request 
+* `no-keepalive` - using each connection for a single request
   * So we can compare the performance characteristics of establishing new connections.
   * Establishing a connection is an expensive operation.
   * It involves resource allocation and dispatching tasks between worker threads.
-  * As well as clean-up operations once a connection is closed. 
-  
+  * As well as clean-up operations once a connection is closed.
+
 ### Compared metrics
 
 To compare different solutions, we use the following set of metrics:
 
 * Latency (in microseconds, or `µs`)
   * `p50` (median) - a value that is greater than 50% of observed latency samples
-  * `p90` - 90th percentile, or a value that is better than 9 out 10 latency samples. Usually a good proxy for a perceivable latency by humans.
+  * `p90` - 90th percentile, or a value that is better than 9 out of 10 latency samples. Usually a good proxy for a perceivable latency by humans.
   * `p99` - 99th percentile, the threshold for the worst 1% of samples.
-  * tail-latencies: `p99.9` and `p99.99` - may be important for systems with multiple network hops or large fan-outs (e.g. a request gathers data from tens or hundreds microservices)
-  * `max` - the worst-case. 
+  * tail-latencies: `p99.9` and `p99.99` - may be important for systems with multiple network hops or large fan-outs (e.g., a request gathers data from tens or hundreds of microservices)
+  * `max` - the worst-case.
   * `tm99.9` - trimmed mean, or the mean value of all samples without the best and worst 0.1%. It is more useful than the traditional mean, as it removes a potentially disproportionate influence of outliers: https://en.wikipedia.org/wiki/Truncated_mean
-  * `stddev` - the standard deviation of the latency. The lower the better: https://en.wikipedia.org/wiki/Standard_deviation
-* Throughput `rps` (requests per second) 
+  * `stddev` - the standard deviation of the latency. The lower, the better: https://en.wikipedia.org/wiki/Standard_deviation
+* Throughput `rps` (requests per second)
 * CPU utilization
 * Memory utilization
 
-In other words, we primarily focus on the latency, but also keep an eye on the cost of that latency in terms of CPU/Memory.
-For the `max load` we also assess the maximum possible throughput of the system.
+We primarily focus on the latency and keep an eye on the cost of that latency in terms of CPU/Memory.
+For the `max load,` we also assess the maximum possible throughput of the system.
 
 #### Trimmed mean vs median
 
@@ -57,18 +57,20 @@ Why do we need to observe trimmed mean if we already have median (i.e. `p50`)?
 * `1,2,3,4,5,6,7,8,9,10` - `p50` is `5`, `trimmed mean` is `5`
 * `5,5,5,5,5,6,7,8,9,10` - `p50` is still `5`, however the `trimmed mean` is `6.25`.
 
-The same is applicable to any other percentile. If the team only uses `p90` or `p99` to monitor the performance of their system, they may miss dramatic regressions without being aware of that.
+The same applies to any other percentile. If the team only uses `p90` or `p99` to monitor their system's performance, they may miss dramatic regressions without being aware of that.
 
 Of course, we may use multiple `fences` (`p10`, `p25`, etc.) - but why, if we can use a single metric?
-In contrast, the traditional mean is susceptible to noise and outliers and not as good for capturing the general tendency. 
+In contrast, the traditional mean is susceptible to noise and outliers and not as good for capturing the general tendency.
 
 ### Compared configurations
 
 These benchmarks compare TCP proxies written in different languages, which use Non-blocking I/O.
-Why TCP proxies? This is the simplest application dealing with the network I/O. All it does, is connection establishment and forwarding traffic.
+Why TCP proxies? This is the simplest application dealing with the network I/O. All it does is connection establishment and forward traffic.
+Why Non-blocking I/O? You can read [this post](https://medium.com/swlh/distributed-systems-and-asynchronous-i-o-ef0f27655ce5), which tries to demonstrate why
+Non-blocking I/O is a much better option for network applications.
 
-Let's say, you're building a network service. TCP proxy benchmarks are the lower boundary for the request latency it may have.
-Everything else is added on top of that (e.g. parsing, validating, packing, traversing, construction of data, etc.).
+Let's say you're building a network service. TCP proxy benchmarks are the lower boundary for the request latency it may have.
+Everything else is added on top of that (e.g., parsing, validating, packing, traversing, construction of data, etc.).
 
 So the following solutions are being compared:
 
@@ -80,18 +82,20 @@ So the following solutions are being compared:
 * `NetCrusher` - a Java solution (Java NIO): https://github.com/NetCrusherOrg/NetCrusher-java/
 * `pproxy` - a Python solution based on `asyncio` (running in TCP Proxy mode): https://pypi.org/project/pproxy/
 
+Thanks to [Cesar Mello](https://github.com/cmello/) who coded the TCP proxy in C++ to make this benchmark possible.
+
 ## Testbed
 
-Benchmarking network services is tricky, especially if we need to measure difference down to microseconds granularity.
-To rule out network delays/noise we can try to employ one of the options:
+Benchmarking network services is tricky, especially if we need to measure differences down to microseconds granularity.
+To rule out network delays/noise, we can try to employ one of the options:
 
-* use co-located servers, e.g. VMs on the same physical machine, or in the same rack.
-* use a single VM, but assign CPU cores to different components to avoid overlap 
+* use co-located servers, e.g., VMs on the same physical machine or in the same rack.
+* use a single VM, but assign CPU cores to different components to avoid overlap
 
-Both are not ideal, but the latter seem to be an easier way. We just need to make sure, that the instance type is CPU optimized,
-and it won't suffer from noisy-neighbor issues. In other words, it must have exclusive access to all cores as we're going to drive CPU utilization close to 100%.
+Both are not ideal, but the latter seems to be an easier way. We need to make sure that the instance type is CPU optimized
+and won't suffer from noisy-neighbor issues. In other words, it must have exclusive access to all cores as we're going to drive CPU utilization close to 100%.
 
-E.g. if we use an 8-core machine, we can use the following assignment scheme:
+E.g., if we use an 8-core machine, we can use the following assignment scheme:
 
 * Cores 0-1: Nginx (serves `10kb` of payload per request)
 * Cores 2-3: TCP proxy
@@ -103,11 +107,11 @@ This can be achieved by using [cpu sets](https://codywu2010.wordpress.com/2015/0
 apt-get install cgroup-tools
 ```
 
-Then we can create non-overlapping cpu-sets and run different components without competing for CPU and ruling out any network noise.
+Then we can create non-overlapping CPU-sets and run different components without competing for CPU and ruling out any network noise.
 
 ### Prometheus
 
-`perf-gauge` can emit metrics to `Prometheus.` To launch a stack, you can use https://github.com/xnuter/prom-stack
+`perf-gauge` can emit metrics to `Prometheus.` To launch a stack, you can use https://github.com/xnuter/prom-stack.
 I just forked `prom-stack` and removed anything but `prometheus,` `push-gateway` and `grafana.` You can clone the stack and launch `make.`
 
 Then set the variable with the host, for instance:
@@ -118,26 +122,26 @@ export PROMETHEUS_HOST=10.138.0.2
 
 ### Configurations
 
-Please note, that for all configuration we disable logging to minimize the number of variables, and the level of noise.
+Please note that we disable logging for all configurations to minimize the number of variables and the level of noise.
 
 * [Perf-gauge](./perf-gauge-setup.md)
 * [Nginx](nginx-config.md)
-* TCP Proxies  
+* TCP Proxies
   * [HAProxy - C](haproxy-config.md)
   * [draft-http-tunnel - C++](cpp-config.md)
   * [http-tunnel - Rust](rust-config.md)
-  * [tcp-roxy - Golang](golang-config.md)
+  * [tcp-proxy - Golang](golang-config.md)
   * [NetCrusher - Java](java-config.md)
   * [pproxy - Python](python-config.md)
 
 ## Benchmarks
 
-Okay, we finally got to benchmark results. All benchmark results are split to two batches:
+Okay, we finally got to benchmark results. All benchmark results are split into two batches:
 
 * Baseline, C, C++, Rust - comparing high-performance solutions
 * Rust, Golang, Java, Python - comparing memory-safe languages
 
-Yep, Rust belongs to both worlds. 
+Yep, Rust belongs to both worlds.
 
 * [Moderate RPS](./moderate-tps.md)
 * [Max RPS](./max-tps.md)

diff --git a/examples/golang-config.md b/examples/golang-config.md
@@ -1,4 +1,4 @@
-### Running tpc-proxy (Golang)
+### Running TCP-proxy (Golang)
 
 Repository: https://github.com/ickerwx/tcpproxy/
 

diff --git a/examples/haproxy-config.md b/examples/haproxy-config.md
@@ -1,6 +1,6 @@
 ### Setting up HAProxy
 
-We need to specify TCP frontend and backend. It's important to turn off logging, otherwise it would flood the disk.
+We need to specify TCP frontend and backend. It's important to turn off logging. Otherwise, it would flood the disk.
 Also, it should only use cores #2 and #3: 
 
 ```

diff --git a/examples/max-tps.md b/examples/max-tps.md
@@ -1,32 +1,30 @@
 - [Max TPS](#max-tps)
-    * [High-performance (C, C++, Rust)](#high-performance--c--c----rust-)
+    * [High-performance (C, C++, Rust)](#high-performance-c-c-rust)
         + [Maximum rate achieved](#maximum-rate-achieved)
-        + [Regular percentiles (p50,90,99)](#regular-percentiles--p50-90-99-)
-        + [Tail latency (p99.9 and p99.99)](#tail-latency--p999-and-p9999-)
+        + [Regular percentiles (p50,90,99)](#regular-percentiles-p509099)
+        + [Tail latency (p99.9 and p99.99)](#tail-latency-p999-and-p9999)
         + [Trimmed mean and standard deviation](#trimmed-mean-and-standard-deviation)
         + [CPU consumption](#cpu-consumption)
         + [Summary](#summary)
-    * [Memory-safe languages (Rust, Golang, Java, Python)](#memory-safe-languages--rust--golang--java--python-)
+    * [Memory-safe languages (Rust, Golang, Java, Python)](#memory-safe-languages-rust-golang-java-python)
         + [Maximum rate achieved](#maximum-rate-achieved-1)
-        + [Regular percentiles (p50,90,99)](#regular-percentiles--p50-90-99--1)
-        + [Tail latency (p99.9 and p99.99)](#tail-latency--p999-and-p9999--1)
+        + [Regular percentiles (p50,90,99)](#regular-percentiles-p509099-1)
+        + [Tail latency (p99.9 and p99.99)](#tail-latency-p999-and-p9999-1)
         + [Trimmed mean and standard deviation](#trimmed-mean-and-standard-deviation-1)
         + [CPU consumption](#cpu-consumption-1)
         + [Summary](#summary-1)
     * [Total summary](#total-summary)
     * [Conclusion](#conclusion)
 
-<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>
-
 ### Max TPS
 
-The load is generated without any rate limiting with the concurrency setting `100`.
+The load is generated without any rate-limiting with the concurrency setting `100`.
 
 #### High-performance (C, C++, Rust)
 
 ##### Maximum rate achieved
 
-The most interesting question, is how much RPS each solution can handle?
+The most interesting question is how much RPS each solution can handle?
 
 While Nginx is capable of handling `~60k` requests per second (impressive for just two cores!),
 all three C/C++/Rust are somewhat comparable (but C++ handled slightly more requests):
@@ -35,34 +33,34 @@ all three C/C++/Rust are somewhat comparable (but C++ handled slightly more requ
 * C++ - 48.8k
 * Rust - 46k
 
-![](./prom/max-baseline-c-cpp-rust-rps.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-rps.png)
 
 ##### Regular percentiles (p50,90,99)
 
 The results are somewhat mixed again. While C++ showed better `p50`, it's `p99` is worse.
-At the `p90` level all three are close:
+At the `p90` level, all three are close:
 
-![](./prom/max-baseline-c-cpp-rust-p50-99.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-p50-99.png)
 
 ##### Tail latency (p99.9 and p99.99)
 
 For the tail latency, Rust is better than both C and C++:
 
-![](./prom/max-baseline-c-cpp-rust-tail.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-tail.png)
 
 ##### Trimmed mean and standard deviation
 
-All three are nearly identical, however C++ is a tiny bit better (see the table below for the numbers):
+All three are nearly identical. However, C++ is a tiny bit better (see the table below for the numbers):
 
-![](./prom/max-baseline-c-cpp-rust-mean.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-mean.png)
 
 ##### CPU consumption
 
-CPU utilization is important here. What we want, is to saturate the CPU as much as we can.
+CPU utilization is important here. What we want is to saturate the CPU as much as we can.
 
-Baseline CPU Utilization is 73%, but in fact it is 93% of available cores (as cores 2 and 3 were not used).
+Baseline CPU Utilization is 73%, but in fact, it is 93% of available cores (as cores 2 and 3 were not used).
 
-![](./prom/max-baseline-c-cpp-rust-cpu.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-baseline-c-cpp-rust-cpu.png)
 
 | | CPU Utilization |
 |---|---|
@@ -71,8 +69,8 @@ Baseline CPU Utilization is 73%, but in fact it is 93% of available cores (as co
 |C++ |96%|
 |Rust |93%|
 
-Which means that C++ managed to use more CPU and spent more time handling requests.
-However, it's worth mentioning that the `draft-http-tunnel` is implemented using callbacks, while the Rust solution is based on `tokio`,
+This means that C++ managed to use more CPU and spent more time handling requests.
+However, it's worth mentioning that the `draft-http-tunnel` is implemented using callbacks, while the Rust solution is based on `tokio,`
 which is a feature-rich framework and is much more flexible and extendable.
 
 ##### Summary
@@ -88,45 +86,45 @@ which is a feature-rich framework and is much more flexible and extendable.
 
 ##### Maximum rate achieved
 
-Among memory safe both Rust and Golang showed comparable throughput, while Java and Python were significantly behind:
+Among memory safe, both Rust and Golang showed comparable throughput, while Java and Python were significantly behind:
 
 * Rust - 46k
 * Golang - 42.6k
 * Java - 25.9k
 * Python - 18.3k
 
-![](./prom/max-rust-golang-java-python-rps.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-rps.png)
 
 ##### Regular percentiles (p50,90,99)
 
-Again, we can see, that at `p50`-`p90` level Golang is somewhat comparable to Rust,
+Again, we can see that at `p50`-`p90` level, Golang is somewhat comparable to Rust,
 but quickly deviates at `p99` level, adding almost two milliseconds.
 
 Java and Python exhibit substantially higher latencies, but Java `p99` latency is much worse than Python:
 
-![](./prom/max-rust-golang-java-python-p50-99.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-p50-99.png)
 
 ##### Tail latency (p99.9 and p99.99)
 
-Tail latency shows even larger difference with Rust, and for Java is the worst of all four:
+Tail latency shows an even larger difference with Rust, and for Java is the worst of all four:
 
-![](./prom/max-rust-golang-java-python-tail.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-tail.png)
 
 ##### Trimmed mean and standard deviation
 
 Golang's is comparable to Rust, which is impressive.  
 Again, Java and Python are well behind both Rust and Golang:
 
-![](./prom/max-rust-golang-java-python-mean.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-mean.png)
 
 ##### CPU consumption
 
-CPU utilization is important here. What we want, is to saturate the CPU as much as we can.
+CPU utilization is important here. What we want is to saturate the CPU as much as we can.
 
-As we can see, Rust does the best job of utilizing resources while Golang, Java and Python (in this order) 
+As we can see, Rust does the best job of utilizing resources while Golang, Java, and Python (in this order)
 allow more power to stay idle.
 
-![](./prom/max-rust-golang-java-python-cpu.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-cpu.png)
 
 | | CPU Utilization |
 |---|---|
@@ -159,13 +157,13 @@ allow more power to stay idle.
 #### Conclusion
 
 The Rust solution is on par with C/C++ solutions at all levels.
-Golang is slightly worse, especially for tail latencies, but is close to high performance languages.
+Golang is slightly worse, especially for tail latencies, but is close to high-performance languages.
 
 NetCrusher and pproxy have much worse throughput and latency characteristics if a network service is under heavy load.
 But, NetCrusher (Java) showed the worst max latency measured in seconds:
 
-![](./prom/max-rust-golang-java-python-max.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/max-rust-golang-java-python-max.png)
 
 BTW, try to guess Java on the memory consumption graph:
 
-![](./prom/java-vs-others-memory.png)
+![](https://raw.githubusercontent.com/xnuter/perf-gauge/main/examples/prom/java-vs-others-memory.png)