Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
346 lines (280 sloc) 15.7 KB
title: 'TLS termination: stunnel, nginx & stud'
uuid: d001e3f6-f70b-4b80-ab64-718c5c04bfe1
- ssl
Here is the short version: to get better performance on your TLS
terminator, use [stud][stud] on 64-bit system with patch from
[Émeric Brun for TLS session reuse][studsslreuse] with some AES cipher
suite (128 or 256, does not really matter), without DHE, on as many
cores as needed, a key size of 1024 bits unless more is needed.
!!! "Update (2011.10)" I received some constructive comments about
these tests. After reading this article, I invite you to look at
additional benchmarks in the [second round][next].
!!! "Update (2015.02)" While the content of this article is still
technically sound, ensure you understand it was written by the end of
2011 and therefore doesn't take into account many important aspects,
like the fall of RC4 as an appropriate cipher.
!!! "Update (2016.01)" *stud* is now unmaintained. It has been forked
by the Varnish team and is now called
# Introduction
One year ago, Adam Langley, from Google,
[stated TLS was not computationally expensive any more][ssltips]:
> In January this year (2010), Gmail switched to using HTTPS for
> everything by default. Previously it had been introduced as an option,
> but now all of our users use HTTPS to secure their email between their
> browsers and Google, all the time. In order to do this we had to
> deploy no additional machines and no special hardware. On our
> production frontend machines, SSL/TLS accounts for less than 1% of the
> CPU load, less than 10KB of memory per connection and less than 2% of
> network overhead. Many people believe that SSL takes a lot of CPU time
> and we hope the above numbers (public for the first time) will help to
> dispel that.
> If you stop reading now you only need to remember one thing: SSL/TLS
> is not computationally expensive any more.
This is a very informative post containing tips on how to enhance TLS
performance by reducing latency. However, unlike Gmail, you may still
be worried by TLS raw performances. Maybe each of your frontends is
able to serve 2000 requests per second and CPU overhead for TLS is
therefore significative. Maybe you want to terminate TLS in your load
balancer (even if you know
[this is not the best way to scale][nginxcaching]).
# Tuning TLS
There are a lot of knobs you can use to get better performances from
TLS: choosing the best **implementation**, using more **CPU cores**,
switching to **64-bit** system, choosing the right **cipher suite** and
the appropriate **key size** and enabling a **session cache**.
We will consider three TLS terminators. They all use OpenSSL under
the hood. [stunnel][] is the oldest one and uses a threaded
model. [stud][] is a recent attempt to write a simple TLS
terminator which is efficient and scalable. It uses the
one-process-per-core model. [nginx][] is a web server and it
can be used as reverse proxy and therefore act as TLS terminator. It
is known to be one of the most efficient web server, hence the choice
here. It also features built-in basic load balancing. Since *stud* <del>and
*stunnel*</del>[^0] do not have this feature, we use them with
[HAProxy][haproxy], an high performance load-balancer that usually
defers the TLS part to *stunnel* (but *stud* can act as a drop-in
replacement here).
[^0]: In fact, *stunnel* also has simple round-robin load balancing by
specifying `connect` multiple times in the configuration.
These implementations were already tested by
[Matt Stancliff][matt]. He first concluded
[nginx sucked at TLS][matt1] then that
[nginx did not suck at TLS][matt2] (in the first case, *nginx* was the
only one to select a DHE cipher suite). The details were very
scarce. I hope to bring here more details.
This brings the importance of the selected cipher suite. The client
and the server must agree on the cipher suite to use for symmetric
encryption. The server will select the strongest one it supports from
the list proposed by the client. If you enable some expensive cipher
suite on your server, it is likely to be selected.
The last important knob is TLS session reuse. It matters for both
performance and latency. During the first handshake, the server will
send a *session ID*. The client can use this session ID to request an
*abbreviated handshake*. This handshake is shorter (latency
improvement by one round-trip) and allows us to reuse the previously
negociated master secret (performance improvement by skipping the
costliest part of the handshake). See the
[article from Adam Langley][ssltips] for additional information (with
helpful drawings).
The following table shows what cipher suite is selected by some major
web sites. I have also indicated wheter they support session
reuse. See this
[post from Matt Palmer to know how to check this][nginxcaching].
Site | Cipher | Session reuse?
--------------------------- | ------------------ | --------------- | RC4-SHA | ✓ | RC4-MD5 | ✓ | AES256-SHA | ✕ | RC4-MD5 | ✕ | AES256-SHA | ✕ | DHE-RSA-AES256-SHA | ✓
[RFC 5077][rfc5077] defines another mechanism for session resumption
that does not require any server-side state. You can find additional
information about this optional mechanism in my article about
[enabling session reuse][reuse].
[reuse]: [[en/blog/2011-ssl-session-reuse-rfc5077.html]]
# Benchmarks
All these benchmarks have been run on a HP DL 380 G7, with two [Xeon
L5630][l5630] (running at 2.13GHz for a total of 8 cores), without
hyperthreading, using a 2.6.39 kernel (`HZ` is set to 250) and two
Intel 82576 NIC. Linux conntrack subsystem has been disabled and file
limits have been raised over 100,000.
A [Spirent Avalanche 2900][avalanche] appliance is used to run the
We are interested in raw TLS performances in term of handshakes per
second. Because we want to observe the effects of session reuse, the
scenario runs by most tests lets clients doing four successive
requests and reusing the TLS session for three of them. There is no
HTTP keepalive. There is no compression. The size of the HTTP answer
from the server is 1024 bytes. These choices are done because our
primary target is the number of TLS handshakes per second.
It should also be noted that HAProxy alone is able to handle 22,000
TPS on one core. During the tests, it was never the bottleneck.
## Implementation
We run a first bench to compare *nginx*, *stud* and *stunnel* on one
core. This bench runs on a 32-bit system, with AES128-SHA1 cipher suite
and a 1024bit key size. Here is the result for *stud* (version 0.1,
with [Émeric Brun's patch for TLS session reuse][studsslreuse]):
![stud, 1 CPU][s1]
[s1]: [[!!images/benchs-ssl/stud-1cpu.png]] "Performances of stud on 1 core"
The most important plot is the top one. The blue line is the attempted
number of transactions per second (TPS) while the green one is the
successful number of TPS. When the two lines start to diverge, this
means we reach some kind of maximum: the number of unsuccessful TPS
also starts to raise (the red line).
There are several noticeable points: the maximum TPS (788 TPS), the
maximum TPS with an average response time less than 100 ms (766 TPS),
less than 500 ms (776 TPS) and the maximum TPS with less than 0.01% of
packet loss (783 TPS).
Let's have a look at *nginx* (version 1.0.5):
![nginx, 1 CPU][s0]
[s0]: [[!!images/benchs-ssl/nginx-1cpu.png]] "Performances of nginx on 1 core"
It is able to get the same performance as *stud* (763 TPS). However,
over 512 TPS, you get more than 100 ms of response time. Over 556 TPS,
you even get more than 500 ms! I get this behaviour in every bench
with *nginx* (with or without `proxy_buffering`, with or without
`tcp_nopush`, with or without `tcp_nodelay`) and I am unable to
explain it. Maybe there is
[something wrong in the configuration][nginxconf]. After hitting this
limit, *nginx* starts to process connections by bursts. Therefore, the
number of successful TPS is plotted using a dotted line while the
moving average over 20 seconds is plotted with a solid line.
On the next plot, you can see the compared performance of the three
implementations. On this kind of plot, the number of TPS that we keep
is the maximum number of TPS where loss is less than 0.1% and average
response time is less than 100 ms. *stud* achieves a performance of
766 TPS while *nginx* and *stunnel* are just above 500 TPS.
![stunnel vs stud vs nginx, 1 CPU][s2]
[s2]: [[!!images/benchs-ssl/stunnel-stud-nginx-1cpu.png]] "Compared performances of stunnel, stud and nginx on 1 core"
## Number of cores
With multiple cores, we can affect several of them to do TLS hard
work. To get better performance, we pin each processes to a dedicated
CPU. Here is the repartition used during the tests:
<i></i> | 1 core | 2 cores | 4 cores | 6 cores
------------- | ------- | ------- | ------- | -----------------
CPU 1, Core 1 | network | network | network | network + haproxy
CPU 1, Core 2 | haproxy | haproxy | haproxy | TLS
CPU 1, Core 3 | TLS | TLS | TLS | TLS
CPU 1, Core 4 | - | TLS | TLS | TLS
CPU 2, Core 5 | - | - | network | TLS
CPU 2, Core 6 | - | - | TLS | TLS
CPU 2, Core 7 | - | - | TLS | TLS
CPU 2, Core 8 | system | system | system | system + haproxy
Remember, we have two CPU, 4 cores on each CPU. Cores on the same CPU
share the same L2 cache (on this model). Therefore, the arrangement
inside a CPU is not really important. When possible, we try to keep
things together on the same CPU. The TLS processes always get
exclusive use of a core since they will always be the most busy. The
repartition is done with [cpuset(7)][cpuset] for userland processes
and by setting `smp_affinity` for network card interrupts.
We keep one core for the system. It is quite important to be able to
connect and monitor the system correctly even when it is loaded. With
so many cores, we can afford to reserve one core for this usage.
Beware! There is a trap when pining processes or IRQ to a core. You
need to check `/proc/cpuinfo` to discover the mapping between kernel
processors and physical cores. Sometimes, second kernel processor can
be the first core of the second physical processor (instead of the
second core of the first physical processor).
![stunnel vs stud vs nginx, 6 CPU][s3]
[s3]: [[!!images/benchs-ssl/scalability-6cores.png]] "stunnel, stud and nginx performances on 6 cores"
As you can see in the plot above, only *stud* is able to scale
properly. *stunnel* is not able to take advantage of the cores and its
performances are worse than with a single core. I think this may be
due to its threaded model and the fact that the userland for this
32-bit system is a bit old. *nginx* is able to achieve the same TPS
than *stud* but is disqualified because of the increased latency. We
won't use *stunnel* in the remaining tests.
## 32-bit vs 64-bit
The easiest way to achieve a performance boost is to switch to a 64-bit
system. TPS are doubled. On our 64-bit system, thanks to the use of a
version of OpenSSL with the appropriate support, AES-NI, a CPU
extension to improve the speed of AES encryption/decryption, is
enabled. Remaining tests are done with a 64-bit system.
![stud vs nginx, 64-bit][s4]
[s4]: [[!!images/benchs-ssl/64bit.png]] "Performances of stud and nginx on a 64-bit system"
## Ciphers and key sizes
In our tests, the influence of ciphers is minimal. There is almost no
difference between AES256 and AES128. Using RC4 adds some latency. The
use of AES-NI may have helped to avoid this latency for AES.
On the other hand, using a 2048bit key size has a huge performance
hit. TPS are divided by 5.
![stud: cipher suites and key sizes][s5]
[s5]: [[!!images/benchs-ssl/ciphers.png]] "Influence of cipher suites and key sizes"
## Session cache
The last thing to check is the influence of TLS session reuse. Since
we are on a local network, we see only one of the positive effects of
session reuse: better TPS because of reduced CPU usages. If there was
a latency penalty, we should also have seen better
TPS thanks to the removed round-trip.
![stud: session reuse][s6]
[s6]: [[!!images/benchs-ssl/session-reuse.png]] "Influence of session reuse"
This plot also explains why *stud* performances fall after the
maximum: because of the failed transactions, the session cache is not
as efficient. This phenomenon does not exist when session reuse is not
# Conclusion
Here is a summary of the maximum TPS reached during the benchmark
(with an average response time below 100ms). Our control use case is
the following: 64-bit, 6 cores, AES128, SHA1, 1024bit key, 4 requests
into the same TLS session.
Context | nginx 1.0.5 | stunnel 4.41 | stud 0.1 ([patched][studsslreuse])
--------------------------- | ----------- | ------------- | ----------------------------------
1 core, 32-bit | 512 TPS | 503 TPS | 766 TPS
2 cores, 32-bit | 599 TPS | - | -
32-bit | 804 TPS | 501 TPS | 4251 TPS
- | **1799 TPS**| - | **9000 TPS**
AES256 | - | - | 8880 TPS
RC4-MD5 | - | - | 7370 TPS
2048bit | - | - | 1643 TPS
no session reuse | - | - | 5844 TPS
80 requests per TLS session | - | - | 10797 TPS
Therefore, *stud* in our control scenario is able to sustain 1500 TPS
per core. It seems to be the best current option for a TLS
termination. It is not available in Debian yet, but I
[intend to package it][itp].
Here is how *stud* is invoked:
# ulimit -n 100000
# stud -n 2 -f,443 -b,80 -c ALL \
-B 1000 -C 20000 --write-proxy \
=(cat server1024.crt server1024.key dhe1024)
You may also want to look at [HAProxy configuration][haproxyconf],
[nginx configuration][nginxconf] and
[stunnel configuration][stunnelconf].
!!! "Update (2011.10)" More benchmarks are available in the [second
part][next]. Keep reading!
[next]: [[en/blog/2011-ssl-benchmark-round2.html]]
*[TPS]: Transactions per second
*[AES]: Advanced Encryption Standard
*[DHE]: Diffie Hellman Ephemeral
*[SSL]: Secure Sockets Layer
*[TLS]: Transport Layer Security
{# Local Variables: #}
{# mode: markdown #}
{# End: #}