/
warp.html
147 lines (147 loc) · 12.1 KB
/
warp.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
<h1 id="warp">Warp</h1>
<p>Authors: Michael Snoyman and Kazu Yamamoto</p>
<p>Warp is a high-performance library of HTTP server side in Haskell, a purely functional programming language. Both Yesod, a web application framework, and Mighttpd, an HTTP server, are implemented over Warp. According to our throughput benchmark, Mighttpd provides performance on par with nginx. This article will explain the architecture of Warp and how we improved its performance.</p>
<h2 id="network-programming-in-haskell">Network programming in Haskell</h2>
<p>Some people may still misunderstand that functional programming languages are slow or impractical. However, to our best knowledge, Haskell provides the best scheme for network programming. This is because that GHC (Glasgow Haskell Compiler), the flagship compiler of Haskell, provides lightweight and robust user thread (or sometime called green thread). In this section, we will briefly explain history of network programming in server side.</p>
<h3 id="native-threads">Native threads</h3>
<p>Traditional servers use a technique called thread programming. In this architecture, each connection is handled by a single process or native thread.</p>
<p>This architecture can be broken down by how to create processes or native threads. When using a thread pool, multiple processes or native threads are created in advance. An example of this is the prefork mode in Apache. Otherwise, a process or native thread is spawn each time a connection is received. Fig XXX illustrates this.</p>
<div class="figure">
<img src="1.png" alt="Native threads" /><p class="caption">Native threads</p>
</div>
<p>The advantage of this architecture is that clear code can be written because the code is not decided into event handlers. Also, because the kernel assigns processes or native threads to available cores, we can balance utilization of cores. Its disadvantage is a large number of context switches between kernel and processes or native threads occur. So, performance gets poor.</p>
<h3 id="event-driven">Event driven</h3>
<p>Recently, it is said that event-driven programming is required to implement high performance servers. In this architecture multiple connections are handled by a single process (Fix XXX). Lighttpd is an example of web server using this architecture.</p>
<div class="figure">
<img src="2.png" alt="Event driven" /><p class="caption">Event driven</p>
</div>
<p>Since there is no need to switch processes, less context switches occur, and performance is improved. This is its chief advantage. However, it has two shortcomings. The first is the fact that only one core can be utilized because there is only a single process. The second is that it requires asynchronous programming, so code is fragmented into event handlers. Asynchronous programming also prevents the conventional use of exception handling (although there are no exceptions in C).</p>
<h3 id="process-per-core">1 process per core</h3>
<p>Many have hit upon the idea of creating N event-driven processes to utilize N cores (Fig XXX). Port 80 must be shared for web servers. Using the prefork technique (please don't confuse with Apache's prefork mode), port sharing can be achieved by modifying code slightly.</p>
<div class="figure">
<img src="3.png" alt="1 process per core" /><p class="caption">1 process per core</p>
</div>
<p>One web server that uses this architecture is nginx. Node.js used the event-driven architecture in the past but it also implemented this scheme recently.</p>
<p>The advantage of this architecture is that it utilizes all cores and improves performance. However, it does not resolve the issue of programs having poor clarity.</p>
<h3 id="user-threads">User threads</h3>
<p>GHC's user threads can be used to solve the code clarity. They are implemented over an event-driven IO manager. Starndard libraries of Haskell use non-blocking system calls so that they can cooperate with the IO manager. GHC's user threads are lightweight: modern computers can run 100,000 user threads smoothly. They are robust: even asynchronous exceptions are catch (we explain this later in detail).</p>
<p>Some languages and libraries provided user threads in the past, but they are not commonly used now because they are not lightweight or are not robust. But in Haskell, most computation is non-destructive. This means that almost all functions are thread-safe. GHC uses data allocation as a safe point of context switch of user threads. Because of functional programming style, new data are frequently created and it is known that <a href="http://www.aosabook.org/en/ghc.html">such data allocation is regulaly enough for context switching</a>.</p>
<p>Use of lightweight threads makes it possible to write code with good clarity like traditional thread programming while keeping high performance (Fig XXX).</p>
<div class="figure">
<img src="4.png" alt="User threads" /><p class="caption">User threads</p>
</div>
<h2 id="warps-architecture">Warp's architecture</h2>
<p>Warp is an HTTP engine for WAI (Web Application Interface). It runs WAI applications over HTTP. As we described before both Yesod and Mighttpd are examples of WAI applications as illustrated in Fix XXX.</p>
<div class="figure">
<img src="wai.png" alt="WAI" /><p class="caption">WAI</p>
</div>
<p>The type of WAI applications is as follows:</p>
<pre><code>type Application = Request -> ResourceT IO Response</code></pre>
<p>In Haskell, argument types of function are separated by right arrows and the most right one is the type of return value. So, we can interpret the definition as an application takes <code>Request</code> and returns <code>Response</code>.</p>
<p>After accepting a new HTTP connection, a dedicated user thread is spawn for the connection. It first receives an HTTP request from a client and parses it to <code>Request</code>. Then, Warp gives the <code>Request</code> to an application and takes a <code>Response</code> from it. Finally, Warp builds an HTTP response based on <code>Response</code> and sends it back to the client. This is illustrated in Fix XXX.</p>
<div class="figure">
<img src="warp.png" alt="Warp" /><p class="caption">Warp</p>
</div>
<p>The user thread repeats this procedure if necessary and terminates by itself when the connection is closed by the peer.</p>
<h2 id="performance-of-warp">Performance of Warp</h2>
<p>Before we explain how to improve the performance of Warp, we would like to show the results of our benchmark. We measured throughput of Mighttpd 2.8.2 (with Warp x.x.x) and nginx 1.2.4. Our benchmark environment is as follows:</p>
<ul>
<li>One "12 cores" machine (Intel Xeon E5645, two sockets, 6 cores per 1 CPU, two QPI between two CPUs)</li>
<li>Linux version 3.2.0 (Ubuntu 12.04 LTS), which is running directly on the machine (i.e. without a hypervisor)</li>
</ul>
<p>We tested several benchmark tools in the past and our favorite one was <code>httperf</code>. Since it uses the <code>select()</code> system call and is just a single process program, it reaches its performance limits when we try to measure HTTP servers on multi-cores. So, we switched to <code>weighttp</code>, which is based on the <code>epoll</code> system call family and can use multiple native threads. We used <code>weighttp</code> as follows:</p>
<pre><code>weighttp -n 100000 -c 1000 -t 3 -k http://127.0.0.1:8000/</code></pre>
<p>This means that 1,000 HTTP connections are established and each connection sends 100 requests. 3 native threads are spawn to carry out these jobs..</p>
<p>For all requests, the same <code>index.html</code> file is returned. We used <code>nginx</code>'s <code>index.html</code> whose size is 151 bytes. As "127.0.0.1" suggests, We measured web servers locally. We should have measured from a remote machine but we don't have suitable environment at this moment. (NOTE: I'm planning to do benchmark using two machines soon.)</p>
<p>Since Linux has many control parameters, we need to configure the parameters carefully. You can find a good introduction about Linux parameter tuning in <a href="http://gwan.com/en_apachebench_httperf.html">ApacheBench & HTTPerf</a>.</p>
<p>We carefully configured both Mighty and <code>nginx</code> as follows:</p>
<ul>
<li>using file descriptor cache</li>
<li>no logging</li>
<li>no rate limitation</li>
</ul>
<p>Since our machine has 12 cores and <code>weighttp</code> uses three native threads, we measured web servers from one worker to eight workers(to our experience, three native thread is enough to measure 8 workers). Here is the result:</p>
<div class="figure">
<img src="multi-workers.png" alt="Performance of Warp and nginx" /><p class="caption">Performance of Warp and nginx</p>
</div>
<p>X-axis is the number of workers and y-axis means throughput whose unit is requests per second.</p>
<h2 id="key-ideas">Key ideas</h2>
<p>There are three key ideas to implement high-performance server in Haskell:</p>
<ol style="list-style-type: decimal">
<li>Issuing as few system calls as possible</li>
<li>Specialization and avoiding re-calculation</li>
<li>Avoiding locks</li>
</ol>
<p>If a system call is issued, CPU time is given to kernel and all user threads stop. So, we need to use as fewe system calls as possible. For a HTTP session to get a static file, Warp calls <code>recv()</code>, <code>send()</code> and <code>sendfile()</code> only (Fig warp.png). <code>open()</code>, <code>stat()</code>, <code>close()</code> and other system calls can be committed thanks to cache mechanism described later.</p>
<p>TBD</p>
<p>TBD</p>
<p>To make our explanation simple, we will talk about Linux only for the rest of this article.</p>
<h2 id="http-request-parser">HTTP request parser</h2>
<ul>
<li>Parser generator vs handmade parser</li>
<li>No timeout care thanks to timeout manager -- From "Warp: A Haskell Web Server"?</li>
<li>Conduit</li>
</ul>
<h2 id="http-response-builder">HTTP response builder</h2>
<h3 id="response-header">response header</h3>
<ul>
<li>Blaze builder vs low level memory operations</li>
</ul>
<h3 id="response-body">response body</h3>
<ul>
<li>Three types</li>
<li>Blaze builder</li>
<li>Conduit</li>
<li>sendfile</li>
</ul>
<h3 id="sending-header-and-body-together">sending header and body together</h3>
<p>When we measured the performance of Warp, we always did it with high concurrency. That is, we always make multiple connections at the same time. It gave us a good result. However, when we set the number of concurrency to 1, we found Warp is really really slow.</p>
<p>We realized that this is because Warp uses the combination of writev() for header and sendfile() for body. In this case, an HTTP header and body are sent in separate TCP packets (Fig xxx).</p>
<div class="figure">
<img src="tcpdump.png" alt="Packet sequence of old Warp" /><p class="caption">Packet sequence of old Warp</p>
</div>
<p>To send them in a single TCP packet (when possible), we switched from <code>writev()</code> to <code>send()</code>. We use the <code>send()</code> system call with the <code>MSG_MORE</code> flag to store a header and the <code>sendfile()</code> system call to send both the stored header and a file. This made the throughput at least 100 times faster.</p>
<h2 id="clean-up-with-timers">Clean-up with timers</h2>
<h3 id="for-connections">For connections</h3>
<ul>
<li>Requirements</li>
<li>System.Timeout.timeout (not scale because one timeout thread per thread)</li>
<li>MVar (slow because homebrew spin lock is used)</li>
<li>IORef</li>
<li>Its algorithm</li>
</ul>
<p>Need a fig</p>
<h3 id="for-file-descriptors">For file descriptors</h3>
<ul>
<li>Requirements -- no leakage -- lookup -- multimap -- fast pruning</li>
<li>Red black tree</li>
</ul>
<p>Need a fig</p>
<h2 id="logging-xxx-necessary">Logging (xxx necessary?)</h2>
<ul>
<li>Handle</li>
<li>From the Mighty article in Monad.Reader</li>
</ul>
<p>Need a fig</p>
<ul>
<li>date</li>
<li>avoiding gettimeofday()</li>
<li>caching the formatted date</li>
</ul>
<h2 id="other-tips">Other tips</h2>
<ul>
<li>accept4</li>
<li>char8</li>
<li>pessimistic read</li>
</ul>
<h2 id="profiling-and-benchmarking">Profiling and benchmarking</h2>
<p>Each item should be included in other chapters.</p>
<ul>
<li>weighttp (done)</li>
<li>GHC profiler</li>
<li>strace</li>
<li>eventlog</li>
<li>prof</li>
<li>tcpdump</li>
</ul>
<h2 id="conclusion">Conclusion</h2>