Skip to content
Newer
Older
100644 899 lines (726 sloc) 41.1 KB
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
1 # Warp
2
c7d2eda @kazu-yamamoto Adding Andreas Voellmy as the third author.
kazu-yamamoto authored
3 Authors: Kazu Yamamoto, Michael Snoyman and Andreas Voellmy
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
4
4972182 @snoyberg A few updates
authored
5 Warp is a high-performance HTTP server library written in Haskell,
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
6 a purely functional programming language.
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
7 Both Yesod, a web application framework, and `mighty`, an HTTP server,
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
8 are implemented over Warp.
9 According to our throughput benchmark,
62b8e46 @snoyberg More tweaks
authored
10 `mighty` provides performance on a par with `nginx`.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
11 This article will explain
849a081 @kazu-yamamoto modifying the draft according to reviewers' comments.
kazu-yamamoto authored
12 the architecture of Warp and how we achieved its performance.
62b8e46 @snoyberg More tweaks
authored
13 Warp can run on many platforms,
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
14 including Linux, BSD variants, MacOS, and Windows.
62b8e46 @snoyberg More tweaks
authored
15 To simplify our explanation, however, we will only talk about Linux
16 for the remainder of this article.
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
17
18 ## Network programming in Haskell
19
b2802f4 @snoyberg Move my corrections to Markdown file
authored
20 Some people believe that
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
21 functional programming languages are slow or impractical.
b2802f4 @snoyberg Move my corrections to Markdown file
authored
22 However, to the best of our knowledge,
23 Haskell provides a nearly ideal approach for network programming.
24 This is because the Glasgow Haskell Compiler (GHC),
25 the flagship compiler for Haskell, provides
62b8e46 @snoyberg More tweaks
authored
26 lightweight and robust user threads (sometimes called green threads).
4dd6d75 @AndreasVoellmy Minor update to transition to review of net programming approaches.
AndreasVoellmy authored
27 In this section, we briefly review some well-known approaches to server-side network programming and compare them with network programming in Haskell.
28 We demonstrate that Haskell offers a combination of programmability and performance not available in other approaches: Haskell's convenient abstractions allow programmers to write clear, simple code, while GHC's sophisticated compiler and multicore runtime system produce multicore programs that execute in a way very similar to the most advanced hand-crafted network programs.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
29
30 ### Native threads
31
32 Traditional servers use a technique called thread programming.
33 In this architecture, each connection is handled
b2802f4 @snoyberg Move my corrections to Markdown file
authored
34 by a single process or native thread- sometimes called an OS thread.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
35
b2802f4 @snoyberg Move my corrections to Markdown file
authored
36 This architecture can be further segmented based on the mechanism used for creating the processes or native threads.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
37 When using a thread pool, multiple processes or native threads are created in advance.
38 An example of this is the prefork mode in Apache.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
39 Otherwise, a process or native thread is spawned each time a connection is received. Fig (TBD:1.png) illustrates this.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
40
35d6b2a @jtvjt Update warp.md
jtvjt authored
41 ![Native threads](https://raw.github.com/snoyberg/posa-chapter/master/1.png)
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
42
3c1b983 @AndreasVoellmy In "Native Threads" avoid mention of event handlers and clarify the a…
AndreasVoellmy authored
43 The advantage of this architecture is that it enables developers to
44 write clear code; in particular, the use of threads allows the code to
45 follow a simple and familiar flow of control and to use simple
46 procedure calls to fetch input or send output. Also, because the
47 kernel assigns processes or native threads to available cores, we can
48 balance utilization of cores. Its disadvantage is that a large number
49 of context switches between kernel and processes or native threads
50 occur, resulting in performance degradation.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
51
52 ### Event driven
53
4972182 @snoyberg A few updates
authored
54 In the world of high-performance servers,
55 the recent trend has been to take advantage of event-driven programming.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
56 In this architecture multiple connections are handled
57 by a single process (Fig (TBD:2.png)).
62b8e46 @snoyberg More tweaks
authored
58 Lighttpd is an example of a web server using this architecture.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
59
35d6b2a @jtvjt Update warp.md
jtvjt authored
60 ![Event driven](https://raw.github.com/snoyberg/posa-chapter/master/2.png)
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
61
62 Since there is no need to switch processes,
849a081 @kazu-yamamoto modifying the draft according to reviewers' comments.
kazu-yamamoto authored
63 fewer context switches occur, and performance is thereby improved.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
64 This is its chief advantage.
65
093009f @AndreasVoellmy Tiny edit.
AndreasVoellmy authored
66 On the other hand, this architecture substantially complicates the network program. In particular, this architecture inverts the flow of control so that the event loop controls the overall execution of the program. Programmers must therefore restructure their program into event handlers, each of which execute only non-blocking code. This restriction prevents programmers from performing IO using procedure calls; instead more complicated asynchronous methods must be used. Along the same lines, conventional exception handling methods are no longer applicable.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
67
849a081 @kazu-yamamoto modifying the draft according to reviewers' comments.
kazu-yamamoto authored
68 ### One process per core
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
69
70 Many have hit upon the idea of creating
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
71 N event-driven processes to utilize N cores (Fig (TBD:3.png)).
62b8e46 @snoyberg More tweaks
authored
72 Each process is called a *worker*.
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
73 A service port must be shared among workers.
671856f @kazu-yamamoto modifying.
kazu-yamamoto authored
74 Using the prefork technique, port sharing can be achieved.
75
76 In traditional process programming, a process for a new connection
77 is forked after the connection is accepted.
78 In contrast, the prefork technique forks processes
4972182 @snoyberg A few updates
authored
79 before new connections are accepted.
80 Despite the shared name,
81 this technique should not be confused with Apache's prefork mode.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
82
849a081 @kazu-yamamoto modifying the draft according to reviewers' comments.
kazu-yamamoto authored
83 ![One process per core](https://raw.github.com/snoyberg/posa-chapter/master/3.png)
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
84
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
85 One web server that uses this architecture is `nginx`.
4972182 @snoyberg A few updates
authored
86 Node.js used the event-driven architecture in the past, but
87 recently it also implemented the prefork technique.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
88 The advantage of this architecture is
cf9018c @kazu-yamamoto brush up.
kazu-yamamoto authored
89 that it utilizes all cores and improves performance.
4972182 @snoyberg A few updates
authored
90 However, it does not resolve the issue of programs having poor clarity,
91 due to the reliance on handler / callback functions.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
92
93 ### User threads
94
fbc7c98 @AndreasVoellmy Edits to smooth the "User threads" section.
AndreasVoellmy authored
95 GHC's user threads can be used to help solve the code clarity issue.
96 In particular, we can handle each HTTP connection in a new user thread.
97 This thread is programmed in a traditional style, using logically blocking I/O
98 calls. This keeps the program clear and simple, while GHC handles the complexities of
99 non-blocking IO and multicore work dispatching.
100
1cf9894 @kazu-yamamoto OS thread -> native thread.
kazu-yamamoto authored
101 Under the hood, GHC multiplexes user threads
102 over a small number of native threads.
103 GHC's runtime system includes a multicore thread scheduler that can
fbc7c98 @AndreasVoellmy Edits to smooth the "User threads" section.
AndreasVoellmy authored
104 switch between user threads cheaply, since it does so without involving any OS
105 context switches.
c0ae9ed @kazu-yamamoto separating paragraph.
kazu-yamamoto authored
106
62b8e46 @snoyberg More tweaks
authored
107 GHC's user threads are lightweight;
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
108 modern computers can run 100,000 user threads smoothly.
62b8e46 @snoyberg More tweaks
authored
109 They are robust; even asynchronous exceptions are caught
fbc7c98 @AndreasVoellmy Edits to smooth the "User threads" section.
AndreasVoellmy authored
110 (this feature is used by the timeout handler, described in Section (TBD:Warp's
111 architecture) and in Section (TBD:Timers for connections)).
112 In addition, the scheduler includes a multicore load balancing algorithm to help
113 utilize capacity of all available cores.
114
115 When a user thread performs a logically blocking I/O operation, such as
116 receiving or sending data on a socket, a non-blocking
ec838a9 @AndreasVoellmy Improve explanation of prefork and move to parallel IO manager.
AndreasVoellmy authored
117 call is actually attempted. If it succeeds, the thread continues immediately without
fbc7c98 @AndreasVoellmy Edits to smooth the "User threads" section.
AndreasVoellmy authored
118 involving the IO manager or the thread scheduler. If the call would block, the
ec838a9 @AndreasVoellmy Improve explanation of prefork and move to parallel IO manager.
AndreasVoellmy authored
119 thread instead registers interest for the relevant event with the runtime
120 system's IO manager component and then indicates to the scheduler that it is waiting. Independently, an IO
de46647 @AndreasVoellmy Minor fix.
AndreasVoellmy authored
121 manager thread monitors events and notifies threads when their events occur, causing them to be
122 re-scheduled for execution. This all happens
fbc7c98 @AndreasVoellmy Edits to smooth the "User threads" section.
AndreasVoellmy authored
123 transparently to the user thread, with no effort on the Haskell programmer's part.
124
1b90b5c @kazu-yamamoto fixing issue number.
kazu-yamamoto authored
125 In Haskell, most computation is non-destructive.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
126 This means that almost all functions are thread-safe.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
127 GHC uses data allocation as a safe point to switch context of user threads.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
128 Because of functional programming style,
cf9018c @kazu-yamamoto brush up.
kazu-yamamoto authored
129 new data are frequently created and it is known that
ec2c5fa @mcox small typo correction
mcox authored
130 [such data allocation occurs regularly enough for context switching](http://www.aosabook.org/en/ghc.html).
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
131
fbc7c98 @AndreasVoellmy Edits to smooth the "User threads" section.
AndreasVoellmy authored
132 Though some languages provided user threads in the past,
133 they are not commonly used now because they are not lightweight
134 or are not robust.
135 Note that some languages provide library-level coroutines but
136 they are not preemptive threads.
c41e7c9 @kazu-yamamoto using goroutine.
kazu-yamamoto authored
137 Note also that Erlang and Go provide lightweight processes and lightweight goroutines, respectively.
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
138
215b19d @kazu-yamamoto removing "still".
kazu-yamamoto authored
139 As of this writing, `mighty` uses the prefork technique to fork processes
f92c745 @kazu-yamamoto smooth reading.
kazu-yamamoto authored
140 to utilize cores and Warp does not have this functionality.
691eb89 @kazu-yamamoto fix a typo and other work.
kazu-yamamoto authored
141 Fig (TBD:4.png) illustrates this arrangement in the context of a web server with the prefork technique written in Haskell,
fa443b4 @kazu-yamamoto light-weight -> native.
kazu-yamamoto authored
142 where each browser connection is handled by a single user thread, and a
691eb89 @kazu-yamamoto fix a typo and other work.
kazu-yamamoto authored
143 single native thread in a process running on a CPU core handles work from several connections.
f92c745 @kazu-yamamoto smooth reading.
kazu-yamamoto authored
144
145 ![User threads with one process per core](https://raw.github.com/snoyberg/posa-chapter/master/4.png)
146
17bdbf0 @kazu-yamamoto simplifying the explanation.
kazu-yamamoto authored
147 We found that the IO manager component of the GHC runtime system
148 itself has performance bottlenecks.
149 To solve this problem, we have developed a `parallel
ec838a9 @AndreasVoellmy Improve explanation of prefork and move to parallel IO manager.
AndreasVoellmy authored
150 IO manager` that uses per-core event registration tables and event monitors to
151 greatly improve multicore scaling.
bb303d2 @kazu-yamamoto brushing up explanation about parallel IO manager.
kazu-yamamoto authored
152 A Haskell program with the parallel IO manager is executed
153 as a single process and
e201045 @kazu-yamamoto redrawing pictures.
kazu-yamamoto authored
154 multiple IO managers run as native threads to utilize cores (Fig (TBD:5.png)).
62b8e46 @snoyberg More tweaks
authored
155 Each user thread is executed on any one of the cores.
e201045 @kazu-yamamoto redrawing pictures.
kazu-yamamoto authored
156
157 ![User threads in a sigle process](https://raw.github.com/snoyberg/posa-chapter/master/5.png)
158
bb303d2 @kazu-yamamoto brushing up explanation about parallel IO manager.
kazu-yamamoto authored
159 GHC version 7.8 including the parallel IO manager will be released
160 in the autumn of 2013.
161 With GHC version 7.8,
17bdbf0 @kazu-yamamoto simplifying the explanation.
kazu-yamamoto authored
162 Warp itself will be able to use this architecture without any modifications
163 and `mighty` will not need to use the prefork technique.
a89f85a @kazu-yamamoto brushing up.
kazu-yamamoto authored
164
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
165 ## Warp's architecture
166
b2802f4 @snoyberg Move my corrections to Markdown file
authored
167 Warp is an HTTP engine for the Web Application Interface (WAI).
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
168 It runs WAI applications over HTTP.
b2802f4 @snoyberg Move my corrections to Markdown file
authored
169 As we described above, both Yesod and `mighty` are
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
170 examples of WAI applications, as illustrated in Fig (TBD:wai.png).
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
171
35d6b2a @jtvjt Update warp.md
jtvjt authored
172 ![Web Application Interface (WAI)](https://raw.github.com/snoyberg/posa-chapter/master/wai.png)
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
173
174 The type of WAI applications is as follows:
175
176 type Application = Request -> ResourceT IO Response
177
178 In Haskell, argument types of function are separated by right arrows and
4972182 @snoyberg A few updates
authored
179 the rightmost one is the type of the return value.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
180 So, we can interpret the definition
4972182 @snoyberg A few updates
authored
181 as: a WAI application takes a `Request` and returns a `Response`,
182 used in the context where I/O is possible and resources are well managed.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
183
62b8e46 @snoyberg More tweaks
authored
184 After accepting a new HTTP connection, a dedicated user thread is spawned for the
cf9018c @kazu-yamamoto brush up.
kazu-yamamoto authored
185 connection.
186 It first receives an HTTP request from a client
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
187 and parses it to `Request`.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
188 Then, Warp gives the `Request` to a WAI application and
62b8e46 @snoyberg More tweaks
authored
189 receives a `Response` from it.
190 Finally, Warp builds an HTTP response based on the `Response` value
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
191 and sends it back to the client.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
192 This is illustrated in Fig (TBD:warp.png).
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
193
35d6b2a @jtvjt Update warp.md
jtvjt authored
194 ![The architecture of Warp](https://raw.github.com/snoyberg/posa-chapter/master/warp.png)
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
195
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
196 The user thread repeats this procedure
62b8e46 @snoyberg More tweaks
authored
197 as necessary and terminates itself
198 when the connection is closed by the peer
64f00cd @snoyberg Corrections from Serge Le Huitouze
authored
199 or an invalid request is received.
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
200 It is also killed by the dedicated user thread for timeout
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
201 if a significant amount of data is not received for a certain period.
cf9018c @kazu-yamamoto brush up.
kazu-yamamoto authored
202
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
203 ## Performance of Warp
204
205 Before we explain how to improve the performance of Warp,
206 we would like to show the results of our benchmark.
2214fa7 @kazu-yamamoto brushing up.
kazu-yamamoto authored
207 We measured throughput of `mighty` version 2.8.4 (with Warp version 1.3.8.1) and `nginx` version 1.4.0.
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
208 Our benchmark environment is as follows:
209
3a940d4 @kazu-yamamoto updating benchmark results.
kazu-yamamoto authored
210 - Two "12 cores" machines (Intel Xeon E5645, two sockets, 6 cores per 1 CPU) connected with 1Gpbs Ethernet
211 - One directly runs Linux version 3.2.0 (Ubuntu 12.04 LTS)
212 - The other directly runs FreeBSD 9.1
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
213
214 We tested several benchmark tools in the past and
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
215 our favorite one was `httperf`.
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
216 Since it uses `select()` and is just a single process program,
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
217 it reaches its performance limits when we try to measure HTTP servers on
218 multi-cores.
219 So, we switched to `weighttp`, which
377351f @kazu-yamamoto brushing up.
kazu-yamamoto authored
220 is based on `libev` (the `epoll` family) and can use
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
221 multiple native threads.
3a940d4 @kazu-yamamoto updating benchmark results.
kazu-yamamoto authored
222 We used `weighttp` from FreeBSD as follows:
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
223
3a940d4 @kazu-yamamoto updating benchmark results.
kazu-yamamoto authored
224 weighttp -n 100000 -c 1000 -t 10 -k http://<ip_address>:<port_number>/
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
225
b2802f4 @snoyberg Move my corrections to Markdown file
authored
226 This means that 1,000 HTTP connections are established, with
227 each connection sending 100 requests.
fd97138 @kazu-yamamoto fixing explanation of benchmark.
kazu-yamamoto authored
228 10 native threads are spawned to carry out these jobs.
178f67e @kazu-yamamoto breaking paragraph into multiple lines.
kazu-yamamoto authored
229
fd97138 @kazu-yamamoto fixing explanation of benchmark.
kazu-yamamoto authored
230 The target web servers were compiled on Linux.
178f67e @kazu-yamamoto breaking paragraph into multiple lines.
kazu-yamamoto authored
231 For all requests, the same `index.html` file is returned.
b2802f4 @snoyberg Move my corrections to Markdown file
authored
232 We used `nginx`'s `index.html`, whose size is 151 bytes.
178f67e @kazu-yamamoto breaking paragraph into multiple lines.
kazu-yamamoto authored
233
3a940d4 @kazu-yamamoto updating benchmark results.
kazu-yamamoto authored
234 Since Linux/FreeBSD have many control parameters,
178f67e @kazu-yamamoto breaking paragraph into multiple lines.
kazu-yamamoto authored
235 we need to configure the parameters carefully.
62b8e46 @snoyberg More tweaks
authored
236 You can find a good introduction to
178f67e @kazu-yamamoto breaking paragraph into multiple lines.
kazu-yamamoto authored
237 Linux parameter tuning in [ApacheBench & HTTPerf](http://gwan.com/en_apachebench_httperf.html).
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
238 We carefully configured both `mighty` and `nginx` as follows:
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
239
62b8e46 @snoyberg More tweaks
authored
240 - Enabled file descriptor cache
241 - Disabled logging
242 - Disabled rate limitation
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
243
178f67e @kazu-yamamoto breaking paragraph into multiple lines.
kazu-yamamoto authored
244 Here is the result:
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
245
3a940d4 @kazu-yamamoto updating benchmark results.
kazu-yamamoto authored
246 ![Performance of Warp and `nginx`](https://raw.github.com/snoyberg/posa-chapter/master/benchmark.png)
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
247
62b8e46 @snoyberg More tweaks
authored
248 The x-axis is the number of workers and the y-axis gives throughput,
249 measured in requests per second.
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
250
6637e69 @kazu-yamamoto updating the benchmark graph.
kazu-yamamoto authored
251 - mighty 2.8.4 (GHC 7.7): compiled with GHC version 7.7.20130504 (to be GHC version 7.8).
252 It uses the parallel IO manager with only one worker.
253 GHC runtime option, `+RTS -qa -A128m -N<x>` is specified where
254 <x> is the number of cores and 128m is the allocation area size used by the garbage collector.
a4e6f15 @kazu-yamamoto updating benchmark.
kazu-yamamoto authored
255
6637e69 @kazu-yamamoto updating the benchmark graph.
kazu-yamamoto authored
256 - mighty 2.8.4 (GHC 7.6.3): compiled with GHC version 7.6.3 (which is the latest stable version).
a4e6f15 @kazu-yamamoto updating benchmark.
kazu-yamamoto authored
257
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
258 ## Key ideas
259
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
260 There are four key ideas to implement high-performance servers in Haskell:
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
261
262 1. Issuing as few system calls as possible
4972182 @snoyberg A few updates
authored
263 2. Specialized function implementations, and avoiding re-calculation
7cafaa6 @kazu-yamamoto start writing the article.
kazu-yamamoto authored
264 3. Avoiding locks
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
265 4. Using proper data structures
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
266
a89f85a @kazu-yamamoto brushing up.
kazu-yamamoto authored
267 ### Issuing as few system calls as possible
268
837a5f3 @AndreasVoellmy Revise text that claims that GHC blocks all threads when a system cal…
AndreasVoellmy authored
269 Although system calls are typically inexpensive on most modern operating
270 systems, they can add a significant computational burden when called frequently.
271 Indeed, Warp performs several system calls when serving each request, including
272 `recv()`, `send()` and `sendfile()` (a system call enabling zero-copying a file)
273 only (Fig (TBD:warp.png)). Other system calls, such as `open()`, `stat()` and
274 `close()` can often be omitted when processing a single request, thanks to cache
275 mechanism described in Section (TBD:Timers for file descriptors).
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
276
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
277 We can use the `strace` command to see what system calls are actually used.
62b8e46 @snoyberg More tweaks
authored
278 When we observed the behavior of `nginx` with `strace`,
279 we noticed that it used `accept4()`, which we did not know about at the time.
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
280
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
281 Using Haskell's standard network library,
282 a listening socket is created with the non-blocking flag set.
283 When a new connection is accepted from the listening socket,
62b8e46 @snoyberg More tweaks
authored
284 it is necessary to set the corresponding socket as non-blocking as well.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
285 The network library implements this by calling `fcntl()` twice:
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
286 one is to get the current flags and the other is to set
b411f0f @kazu-yamamoto removing some FIXME.
kazu-yamamoto authored
287 the flags with the non-blocking flag *ORed*.
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
288
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
289 On Linux, the non-block flag of a connected socket
290 is always unset even if its listening socket is non-blocking.
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
291 `accept4()` is an extension version of `accept()` on Linux.
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
292 It can set the non-blocking flag when accepting.
293 So, if we use `accept4()`, we can avoid two unnecessary `fcntl()`s.
294 Our patch to use `accept4()` on Linux has been already merged to
295 the network library.
296
4972182 @snoyberg A few updates
authored
297 ### Specialized functions and avoiding re-calculation
a89f85a @kazu-yamamoto brushing up.
kazu-yamamoto authored
298
62b8e46 @snoyberg More tweaks
authored
299 GHC provides a profiling mechanism, but it has a limitation:
300 correct profiling is only possible
301 if a program runs in the foreground and does not spawn child processes.
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
302 So, if we want to profile live activities of servers,
62b8e46 @snoyberg More tweaks
authored
303 we need to take special care for profiling.
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
304
305 `mighty` has this mechanism.
306 Suppose that N is the number of workers
307 in the configuration file of `mighty`.
62b8e46 @snoyberg More tweaks
authored
308 If N is greater than or equal to 2, `mighty` creates N child processes
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
309 and the parent process just works to deliver signals.
62b8e46 @snoyberg More tweaks
authored
310 However, if N is 1, `mighty` does not create any child process.
311 Instead, the executed process itself serves HTTP.
b411f0f @kazu-yamamoto removing some FIXME.
kazu-yamamoto authored
312 Also, `mighty` stays in its terminal if debug mode is on.
849a081 @kazu-yamamoto modifying the draft according to reviewers' comments.
kazu-yamamoto authored
313
62b8e46 @snoyberg More tweaks
authored
314 When we profiled `mighty`,
315 we were surprised that the standard function to format date string
316 consumed the majority of CPU time.
317 As many know, an HTTP server should return GMT date strings
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
318 in header fields such as Date:, Last-Modified:, etc:
319
320 Date: Mon, 01 Oct 2012 07:38:50 GMT
321
322 So, we implemented a special formatter to generate GMT date strings.
323 Comparing the standard function and our specialized function with
324 `criterion`, a standard benchmark library of Haskell,
325 ours are much faster.
326 But if an HTTP server accepts more than one request per second,
327 the server repeats the same formatting again and again.
62b8e46 @snoyberg More tweaks
authored
328 So, we also implemented a cache mechanism for date strings.
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
329
915df3d @kazu-yamamoto modifying.
kazu-yamamoto authored
330 We will also explain specialization and avoiding re-calculation in Section (TBD:Writing the Parser)
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
331 and in Section (TBD:Composer for HTTP response header).
a89f85a @kazu-yamamoto brushing up.
kazu-yamamoto authored
332
333 ### Avoiding locks
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
334
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
335 Unnecessary locks are evil for programming.
64f00cd @snoyberg Corrections from Serge Le Huitouze
authored
336 Our code sometimes uses unnecessary locks imperceptibly
62b8e46 @snoyberg More tweaks
authored
337 because, internally, the runtime systems or libraries use locks.
338 To implement high-performance servers,
339 we need to identify such locks and
340 avoid them if possible.
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
341 It is worth pointing out that
342 locks will become much more critical under
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
343 the parallel IO manager.
62b8e46 @snoyberg More tweaks
authored
344 We will talk about how to identify and avoid locks
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
345 in Section (TBD:Timers for connections) and
346 Section (TBD:Memory allocation).
347
348 ### Using proper data structures
349
fdfba72 @snoyberg Minor tweaks
authored
350 Haskell's standard data structure for strings is `String`,
351 which is a linked list of Unicode characters.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
352 Since list programming is the heart of functional programming,
353 `String` is convenient for many purposes.
354 But for high-performance servers, the list structure is too slow
fdfba72 @snoyberg Minor tweaks
authored
355 and Unicode is overly complex since the HTTP protocol is based on *byte* streams.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
356 Instead, we use `ByteString` to express strings (or buffers).
357 A `ByteString` is an array of bytes with meta data.
358 Thanks to this meta data,
359 splicing without copying is possible.
360 This is described in Section (TBD:Writing the Parser) in detail.
361
362 Other examples of proper data structures are
363 `Builder` and double `IORef`.
364 They are explained in Section (TBD:Composer for HTTP response header)
365 and Section (TBD:Timers for connections), respectively.
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
366
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
367 ## HTTP request parser
368
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
369 Besides the many issues involved with efficient concurrency and I/O in a multicore environment,
370 Warp also needs to be certain that each core is performing its tasks efficiently.
371 In that regard, the most relevant component is the HTTP request processing.
372 The purpose is to take a stream of bytes coming from the incoming socket,
373 parse out the request line and individual headers,
374 and leave the request body to be processed by the application.
375 It must take this information and produce a data structure which the application
3a24fe4 @kazu-yamamoto minor fix.
kazu-yamamoto authored
376 (whether a Yesod application, `mighty`, or something else)
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
377 will use to form its response.
378
379 The request body itself presents some interesting challenges.
380 Warp provides full support for pipelining and chunked request bodies.
381 As a result, Warp must "dechunk" any chunked request bodies before passing them to the application.
382 With pipelining, multiple requests can be transferred on a single connection.
383 Therefore, Warp must ensure that the application does not consume too many bytes,
384 as that would remove vital information from the next request.
385 It must also be sure to discard any data remaining from the request body;
386 otherwise, the remainder will be parsed as the beginning of the next request,
387 causing either an invalid request or a misunderstood request.
388
389 As an example, consider the following theoretical request from a client:
390
391 POST /some/path HTTP/1.1
392 Transfer-Encoding: chunked
393 Content-Type: application/x-www-form-urlencoded
394
395 0008
396 message=
397 000a
398 helloworld
399 0000
400
401 GET / HTTP/1.1
402
403 The HTTP parser must extract the `/some/path` pathname and the `Content-Type` header and pass these to the application.
404 When the application begins reading the request body, it must strip off the chunk headers (e.g., `0008` and `000a`)
405 and instead provide the actual content, i.e. `message=helloworld`.
406 It must also ensure that no more bytes are consumed after the chunk terminator (`0000`)
407 so as not to interfere with the next pipelined request.
408
409 ### Writing the Parser
410
411 Haskell is known for its powerful parsing capabilities.
412 It has traditional parser generators as well as combinator libraries, such as Parsec and Attoparsec.
413 Parsec and Attoparsec's textual module work in a fully Unicode-aware manner.
a49227c @snoyberg A few modifications
authored
414 However, HTTP headers are guaranteed to be ASCII,
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
415 so Unicode awareness is an overhead we need not incur.
416
417 Attoparsec also provides a binary interface for parsing,
418 which would let us bypass the Unicode overhead.
419 But as efficient as Attoparsec is,
420 it still introduces an overhead relative to a hand-rolled parser.
421 So for Warp, we have not used any parser libraries.
422 Instead, we perform all parsing manually.
423
424 This gives rise to another question: how do we represent the actual binary data?
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
425 The answer is a `ByteString`, which is
426 essentially three pieces of data:
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
427 a pointer to some piece of memory,
428 the offset from the beginning of that memory to the data in question,
429 and the size of our data.
430
431 The offset information may seem redundant.
432 We could instead insist that our memory pointer point to the beginning of our data.
433 However, by including the offset, we enable data sharing.
434 Multiple `ByteString`s can all point to the same chunk of memory and use different parts of it (a.k.a., splicing).
435 There is no concern of data corruption, since `ByteString`s- like most Haskell data- are immutable.
436 When the final pointer to a piece of memory is no longer used, then the memory buffer is deallocated.
437
438 This combination is perfect for our use case.
439 When a client sends a request over a socket,
440 Warp will read the data in relatively large chunks (currently 4096 bytes).
441 In most cases, this is large enough to encompass the entire request line
442 and all request headers.
443 Warp will then use its hand-rolled parser to break this large chunk into lines.
444 This can be done quite efficiently since:
445
446 1. We need only scan the memory buffer for newline characters.
447 The bytestring library provides such helper functions,
448 which are implemented with lower-level C functions like `memchr`.
a49227c @snoyberg A few modifications
authored
449 (It's actually a little more complicated than that due to multiline headers,
450 but the same basic approach still applies.)
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
451
452 2. There is no need to allocate extra memory buffers to hold the data.
453 We just take splices from the original buffer.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
454 See figure (TBD:bytestring.png) for a demonstration of splicing individual components from a larger chunk of data.
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
455 It's worth stressing this point:
456 we actually end up with a situation which is more efficient than idiomatic C.
457 In C, strings are null-terminated, so splicing requires
458 allocating a new memory buffer,
459 copying the data from the old buffer,
460 and appending the null character.
461
35d6b2a @jtvjt Update warp.md
jtvjt authored
462 ![Splicing ByteStrings](https://raw.github.com/snoyberg/posa-chapter/master/bytestring.png)
a49227c @snoyberg A few modifications
authored
463
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
464 Once the buffer has been broken into lines,
465 we perform a similar maneuver to turn the header lines into key/value pairs.
466 For the request line, we parse the requested path fairly deeply.
467 Suppose we have a request for:
468
469 GET /buenos/d%C3%ADas HTTP/1.1
470
471 We need to perform the following steps:
472
473 1. Separate the request method, path, and version into individual pieces.
474
475 2. Tokenize the path along forward slashes, ending up with `["buenos", "d%C3%ADas"]`.
476
fe956dd @kazu-yamamoto removing FIXMEs.
kazu-yamamoto authored
477 3. Percent-decode the individual pieces, ending up with `["buenos", "d\195\173as"]`.
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
478
479 4. UTF8-decode each piece, finally arriving at Unicode-aware text: `["buenos", "días"]`.
480
481 There are a few performance gains we have in this process:
482
483 1. As with newline checking, finding forward slashes is a very efficient operation.
484
485 2. We use an efficient lookup table for turning the Hex characters into numerical values.
486 This code is a single memory lookup and involves no branching.
487
488 3. UTF8-decoding is a highly optimized operation in the text package.
489 Likewise, the text package represents this data in an efficient, packed representation.
490
491 4. Due to Haskell's laziness, this calculation will be performed on demand.
492 If the application in question does not need the textual version of the path,
493 none of these steps will be performed.
494
495 The final piece of parsing we perform is dechunking.
496 In many ways, dechunking is a simpler form of parsing.
497 We parse a single Hex number, and then read the stated number of bytes.
498 Those bytes are passed on verbatim- without any buffer copying- to the application.
499
500 ### Conduit
501
502 This article has mentioned a few times the concept of passing the request body to the application.
503 It has also hinted at the issue of the application passing a response back to the server,
504 and the server receiving data from and sending data to the socket.
377351f @kazu-yamamoto brushing up.
kazu-yamamoto authored
505 A final related point not yet discussed is *middleware*,
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
506 which are components sitting between the server and application that somehow modify the request and/or response.
377351f @kazu-yamamoto brushing up.
kazu-yamamoto authored
507 The definition of a middleware is:
508
509 type Middleware = Application -> Application
510
b90608a @snoyberg Added Middleware type and explanation
authored
511 The intuition behind this is that a middleware will take some "internal" application,
512 preprocess the request,
513 pass it to the internal application to get a response,
514 and then postprocess the response.
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
515 For our purposes, a prime example would be a gzip middleware,
516 which automatically compresses response bodies.
517
77a1343 @snoyberg Rewrote conduit section
authored
518 A prerequisite for the creation of such middlewares
519 is a means of modifying both incoming and outgoing data streams.
520 A standard approach historically in the Haskell world has been *lazy I/O*.
521 With lazy I/O, we represent a stream of values as a single, pure data structure.
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
522 As more data is requested from this structure, I/O actions will be performed to grab the data from its source.
523 Lazy I/O provides a huge level of composability.
77a1343 @snoyberg Rewrote conduit section
authored
524 However, for a high-throughput server, it presents a major obstacle:
525 resource finalization in lazy I/O is non-deterministic.
526 Using lazy I/O, it would be very easy for a server under high load
527 to quickly run out of file descriptors.
528
529 It would also be possible to use a lower level abstraction,
530 essentially dealing directly with read and write functions.
531 However, one of the advantages of Haskell is its high level approach,
532 allowing us to reason about the behavior of our code.
533 It's also not obvious how such a solution would deal
534 with some of the common issues which arise when creating web applications.
535 For example, it's often necessary to have a buffering solution,
536 where we read a certain amount of data at one step (e.g., the request header processing),
537 and read the remainder in a separate part of the codebase (e.g., the web application).
538
539 To address this dilemna, the WAI protocol- and therefore Warp- is built on top of the conduit package.
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
540 This package provides an abstraction for streams of data.
64f00cd @snoyberg Corrections from Serge Le Huitouze
authored
541 It keeps much of the composability of lazy I/O,
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
542 provides a buffering solution,
543 and guarantees deterministic resource handling.
544 Exceptions are also kept where they belong,
77a1343 @snoyberg Rewrote conduit section
authored
545 in the parts of your code which deal with I/O,
546 instead of hiding them in a data structure claiming to be pure.
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
547
548 Warp represents the incoming stream of bytes from the client as a `Source`,
549 and writes data to be sent to the client to a `Sink`.
550 The `Application` is provided a `Source` with the request body,
551 and provides a response as a `Source` as well.
552 Middlewares are able to intercept the `Source`s for the request and response bodies
553 and apply transformations to them.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
554 Figure (TBD:middleware.png) demonstrates how a middleware fits between Warp and an application.
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
555 The composability of the conduit package makes this an easy and efficient operation.
556
35d6b2a @jtvjt Update warp.md
jtvjt authored
557 ![Middlewares](https://raw.github.com/snoyberg/posa-chapter/master/middleware.png)
a49227c @snoyberg A few modifications
authored
558
77a1343 @snoyberg Rewrote conduit section
authored
559 Elaborating on the gzip middleware example,
915df3d @kazu-yamamoto modifying.
kazu-yamamoto authored
560 conduit allows us to create a middleware which runs in a nearly optimal manner.
77a1343 @snoyberg Rewrote conduit section
authored
561 The original `Source` provided by the application is connected to the `gzip` `Conduit`.
562 As each new chunk of data is produced by the initial `Source`,
563 it is fed into the `zlib` library, filling up a buffer with compressed bytes.
564 When that buffer is filled, it is emitted,
565 either to another middleware,
566 or to Warp.
567 Warp then takes this compressed buffer and sends it over the socket to the client.
568 At this point, the buffer can either be reused, or its memory freed.
569 In this way, we have optimal memory usage,
570 do not produce any extra data in the case of network failure,
571 and lessen the garbage collection burden for the runtime.
572
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
573 Conduit itself is a large topic, and therefore will not be covered in more depth.
574 Suffice it to say for now that conduit's usage in Warp is a contributing factor to its high performance.
575
576 ### Slowloris protection
577
578 We have one final concern: the [slowloris attack](http://en.wikipedia.org/wiki/Slowloris).
579 This is a form of a Denial of Service (DOS) attack wherein each client sends very small amounts of information.
580 By doing so, the client is able to maintain a higher number of connections on the same hardware/bandwidth.
581 Since the web server has a constant overhead for each open connection regardless of bytes being transferred,
582 this can be an effective attack.
583 Therefore, Warp must detect when a connection is not sending enough data over the network and kill it.
584
585 We discuss the timeout manager in more detail below,
586 which is the true heart of slowloris protection.
587 When it comes to request processing,
588 our only requirement is to tease the timeout handler to let it know more data has been received from the client.
589 In Warp, this is all done at the conduit level.
590 As mentioned, the incoming data is represented as a `Source`. As part of that `Source`,
591 every time a new chunk of data is received, the timeout handler is teased.
592 Since teasing the handler is such a cheap operation (essentially just a memory write),
593 slowloris protection does not hinder the performance of individual connection handlers in a significant way.
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
594
36eaf97 @kazu-yamamoto brushing up.
kazu-yamamoto authored
595 ## HTTP response composer
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
596
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
597 This section describes the HTTP response composer of Warp.
62b8e46 @snoyberg More tweaks
authored
598 A WAI `Response` has three constructors:
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
599
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
600 ResponseFile Status ResponseHeaders FilePath (Maybe FilePart)
36eaf97 @kazu-yamamoto brushing up.
kazu-yamamoto authored
601 ResponseBuilder Status ResponseHeaders Builder
602 ResponseSource Status ResponseHeaders (Source (ResourceT IO) (Flush Builder))
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
603
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
604 `ResponseFile` is used to send a static file while
605 `ResponseBuilder` and `ResponseSource` are for sending
606 dynamic contents created in memory.
607 Each constructor includes both `Status` and `ResponseHeaders`.
62b8e46 @snoyberg More tweaks
authored
608 `ResponseHeaders` is defined as a list of key/value header pairs.
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
609
36eaf97 @kazu-yamamoto brushing up.
kazu-yamamoto authored
610 ### Composer for HTTP response header
611
62b8e46 @snoyberg More tweaks
authored
612 The old composer built HTTP response header with a `Builder`,
613 a rope-like data structure.
614 First, it converted `Status` and each element of `ResponseHeaders`
615 into a `Builder`. Each conversion runs in O(1).
377351f @kazu-yamamoto brushing up.
kazu-yamamoto authored
616 Then, it concatenates them by repeatedly appending one `Builder` to another.
62b8e46 @snoyberg More tweaks
authored
617 Thanks to the properties of `Builder`, each append operation also runs in O(1).
23baa94 @kazu-yamamoto brushing up.
kazu-yamamoto authored
618 Lastly, it packs an HTTP response header
619 by copying data from `Builder` to a buffer in O(N).
620
621 In many cases, the performance of `Builder` is sufficient.
622 But we experienced that it is not fast enough for
623 high-performance servers.
624 To eliminate the overhead of `Builder`,
62b8e46 @snoyberg More tweaks
authored
625 we implemented a special composer for HTTP response headers
23baa94 @kazu-yamamoto brushing up.
kazu-yamamoto authored
626 by directly using `memcpy()`, a highly tuned byte copy function in C.
36eaf97 @kazu-yamamoto brushing up.
kazu-yamamoto authored
627
628 ### Composer for HTTP response body
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
629
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
630 For `ResponseBuilder` and `ResponseSource`,
377351f @kazu-yamamoto brushing up.
kazu-yamamoto authored
631 the `Builder` values provided by the application are packed into a list of `ByteString`.
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
632 A composed header is prepended to the list and
23fe9be @kazu-yamamoto minor fix.
kazu-yamamoto authored
633 `send()` is used to send the list in a fixed buffer.
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
634
635 For `ResponseFile`,
23fe9be @kazu-yamamoto minor fix.
kazu-yamamoto authored
636 Warp uses `send()` and `sendfile()` to send
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
637 an HTTP response header and body, respectively.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
638 Fig (TBD:warp.png) illustrates this case.
cfed31d @basvandijk Spelling
basvandijk authored
639 Again, `open()`, `stat()`, `close()` and other system calls can be ommitted
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
640 thanks to the cache mechanism described in Section (TBD:Timers for file descriptors).
64f00cd @snoyberg Corrections from Serge Le Huitouze
authored
641 The following subsection describes another performance tuning
62b8e46 @snoyberg More tweaks
authored
642 in the case of `ResponseFile`.
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
643
36eaf97 @kazu-yamamoto brushing up.
kazu-yamamoto authored
644 ### Sending header and body together
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
645
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
646 When we measured the performance of Warp to send static files,
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
647 we always did it with high concurrency.
62b8e46 @snoyberg More tweaks
authored
648 That is, we always made multiple connections at the same time.
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
649 It gave us a good result.
650 However, when we set the number of concurrency to 1,
62b8e46 @snoyberg More tweaks
authored
651 we found Warp to be really slow.
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
652
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
653 Observing the results of the `tcpdump` command,
62b8e46 @snoyberg More tweaks
authored
654 we realized that this is because originally Warp used
23fe9be @kazu-yamamoto minor fix.
kazu-yamamoto authored
655 the combination of `writev()` for header and `sendfile()` for body.
9140b49 @kazu-yamamoto Adding "Using proper data structures".
kazu-yamamoto authored
656 In this case, an HTTP header and body are sent in separate TCP packets (Fig (TBD:tcpdump.png)).
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
657
35d6b2a @jtvjt Update warp.md
jtvjt authored
658 ![Packet sequence of old Warp](https://raw.github.com/snoyberg/posa-chapter/master/tcpdump.png)
b665207 @kazu-yamamoto writing "header and body" section.
kazu-yamamoto authored
659
660 To send them in a single TCP packet (when possible),
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
661 new Warp switched from `writev()` to `send()`.
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
662 It uses `send()` with the `MSG_MORE` flag to store a header
663 and `sendfile()` to send both the stored header and a file.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
664 This made the throughput at least 100 times faster
665 according to our throughput benchmark.
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
666
667 ## Clean-up with timers
668
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
669 This section explain how to implement connection timeout and
670 how to cache file descriptors.
671
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
672 ### Timers for connections
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
673
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
674 To prevent slowloris attacks,
675 communication with a client should be canceled
676 if the client does not send a significant amount of data
677 for a certain period.
678 Haskell provides a standard function called `timeout`
679 whose type is as follows:
680
681 Int -> IO a -> IO (Maybe a)
682
62b8e46 @snoyberg More tweaks
authored
683 The first argument is the duration of the timeout, in microseconds.
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
684 The second argument is an action which handles input/output (IO).
685 This function returns a value of `Maybe a` in the IO context.
686 `Maybe` is defined as follows:
687
688 data Maybe a = Nothing | Just a
689
62b8e46 @snoyberg More tweaks
authored
690 `Nothing` indicates an error (without reason information) and
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
691 `Just` encloses a successful value `a`.
692 So, `timeout` returns `Nothing`
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
693 if an action is not completed in a specified time.
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
694 Otherwise, a successful value is returned wrapped with `Just`.
23baa94 @kazu-yamamoto brushing up.
kazu-yamamoto authored
695 `timeout` eloquently shows how great Haskell's composability is.
1698603 @kazu-yamamoto brush up.
kazu-yamamoto authored
696
4972182 @snoyberg A few updates
authored
697 `timeout` is useful for most purposes,
698 but its performance is inadequate for implementing high-performance servers.
699 The problem is that, for each timeout created, this function will spawn a new user thread.
700 While user threads are cheaper than system threads,
701 they still involve an overhead which can add up.
702 We need to avoid the creation of a user thread for each connection's timeout handling.
62b8e46 @snoyberg More tweaks
authored
703 So, we implemented a timeout system which uses only
704 one user thread, called the timeout manager, to handle the timeouts of all connections.
4972182 @snoyberg A few updates
authored
705 At its core are the following two points:
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
706
707 - Double `IORef`s
708 - Safe swap and merge algorithm
709
710 Suppose that status of connections is described as `Active` and `Inactive`.
711 To clean up inactive connections,
62b8e46 @snoyberg More tweaks
authored
712 the timeout manager repeatedly inspects the status of each connection.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
713 If status is `Active`, the timeout manager turns it to `Inactive`.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
714 If `Inactive`, the timeout manager kills its associated user thread.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
715
62b8e46 @snoyberg More tweaks
authored
716 Each status is referred to by an `IORef`.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
717 `IORef` is a reference whose value can be destructively updated.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
718 In addition to the timeout manager,
62b8e46 @snoyberg More tweaks
authored
719 each user thread repeatedly turns its status to `Active` through its own `IORef` as its connection actively continues.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
720
62b8e46 @snoyberg More tweaks
authored
721 The timeout manager uses a list of the `IORef` to these statuses.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
722 A user thread spawned for a new connection
723 tries to prepend its new `IORef` for an `Active` status to the list.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
724 So, the list is a critical section and we need atomicity to keep
725 the list consistent.
726
fa44bc3 @kazu-yamamoto absolute URL for two figs.
kazu-yamamoto authored
727 ![A list of status. `A` and `I` indicates `Active` and `Inactive`, respectively](https://raw.github.com/snoyberg/posa-chapter/master/timeout.png)
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
728
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
729 A standard way to keep consistency in Haskell is `MVar`.
62b8e46 @snoyberg More tweaks
authored
730 But `MVar` is slow,
8063aa0 @kazu-yamamoto s/spin lock/lock/g
kazu-yamamoto authored
731 since each `MVar` is protected with a home-brewed lock.
64f00cd @snoyberg Corrections from Serge Le Huitouze
authored
732 Instead, we used another `IORef` to refer to the list and `atomicModifyIORef`
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
733 to manipulate it.
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
734 `atomicModifyIORef` is a function to atomically update `IORef`'s values.
735 It is fast since it is implemented via CAS (Compare-and-Swap),
8063aa0 @kazu-yamamoto s/spin lock/lock/g
kazu-yamamoto authored
736 which is much faster than locks.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
737
738 The following is the outline of the safe swap and merge algorithm:
739
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
740 do xs <- atomicModifyIORef ref (\ys -> ([], ys)) -- swap with an empty list, []
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
741 xs' <- manipulates_status xs
742 atomicModifyIORef ref (\ys -> (merge xs' ys, ()))
743
744 The timeout manager atomically swaps the list with an empty list.
745 Then it manipulates the list by turning status and/or removing
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
746 unnecessary status for killed user threads.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
747 During this process, new connections may be created and
62b8e46 @snoyberg More tweaks
authored
748 their status are inserted via `atomicModifyIORef` by
749 their corresponding user threads.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
750 Then, the timeout manager atomically merges
0e03955 @kazu-yamamoto Possible change for #13.
kazu-yamamoto authored
751 the pruned list and the new list.
752 Thanks to the lazy evaluation of Haskell,
332cd95 @kazu-yamamoto Fix: #13.
kazu-yamamoto authored
753 the application of the merge function is done in O(1) and
754 the merge operation, which is in O(N), is postponed
755 until its values are actually consumed.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
756
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
757 ### Timers for file descriptors
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
758
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
759 Let's consider the case where Warp sends the entire file by `sendfile()`.
760 Unfortunately, we need to call `stat()`
761 to know the size of the file
762 because `sendfile()` on Linux requires the caller
763 to specify how many bytes to be sent
377351f @kazu-yamamoto brushing up.
kazu-yamamoto authored
764 (`sendfile()` on FreeBSD/MacOS has a magic number *0*
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
765 which indicates the end of file).
766
767 If WAI applications know the file size,
768 Warp can avoid `stat()`.
769 It is easy for WAI applications to cache file information
770 such as size and modification time.
771 If cache timeout is fast enough (say 10 seconds),
772 the risk of cache inconsistency problem is not serious.
773 And because we can safely clean up the cache,
774 we don't have to worry about leakage.
775
776 Since `sendfile()` requires a file descriptor,
777 the naive sequence to send a file is
778 `open()`, `sendfile()` repeatedly if necessary, and `close()`.
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
779 In this subsection, we consider how to cache file descriptors
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
780 to avoid `open()` and `close()`.
781 Caching file descriptors should work as follows:
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
782 if a client requests that a file be sent,
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
783 a file descriptor is opened by `open()`.
784 And if another client requests the same file shortly thereafter,
62b8e46 @snoyberg More tweaks
authored
785 the previously opened file descriptor is reused.
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
786 At a later time, the file descriptor is closed by `close()`
787 if no user thread uses it.
788
62b8e46 @snoyberg More tweaks
authored
789 A typical tactic for this case is reference counting.
790 But we were not sure that we could implement a robust mechanism
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
791 for a reference counter.
792 What happens if a user thread is killed for unexpected reasons?
793 If we fail to decrement its reference counter,
794 the file descriptor leaks.
795 We noticed that the scheme of connection timeout is safe
796 to reuse as a cache mechanism for file descriptors
797 because it does not use reference counters.
64f00cd @snoyberg Corrections from Serge Le Huitouze
authored
798 However, we cannot simply reuse the timeout manager for several reasons.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
799
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
800 Each user thread has its own status. So, status is not shared.
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
801 But we would like to cache file descriptors to avoid `open()` and
802 `close()` by sharing.
62b8e46 @snoyberg More tweaks
authored
803 So, we need to search for a file descriptor for a requested file
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
804 from cached ones.
805 Since this look-up should be fast, we should not use a list.
806 Also,
807 because requests are received concurrently,
3a33b18 @kazu-yamamoto adding text and pics.
kazu-yamamoto authored
808 two or more file descriptors for the same file may be opened.
809 So, we need to store multiple file descriptors for a single file name.
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
810 This is technically called a *multimap*.
811
812 We implemented a multimap whose look-up is O(log N) and
813 pruning is O(N) with red-black trees
814 whose node contains a non-empty list.
62b8e46 @snoyberg More tweaks
authored
815 Since a red-black tree is a binary search tree,
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
816 look-up is O(log N) where N is the number of nodes.
86ad76d @kazu-yamamoto modifying.
kazu-yamamoto authored
817 Also, we can translate it into an ordered list in O(N).
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
818 In our implementation,
62b8e46 @snoyberg More tweaks
authored
819 pruning nodes which contain a file descriptor to be closed is
b9a56e2 @kazu-yamamoto brush up.
kazu-yamamoto authored
820 also done during this step.
62b8e46 @snoyberg More tweaks
authored
821 We adopted an algorithm to convert an ordered list to a red-black tree in O(N).
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
822
7db942c @kazu-yamamoto explaining strace.
kazu-yamamoto authored
823 ## Future work
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
824
62b8e46 @snoyberg More tweaks
authored
825 We have several ideas for improvement of Warp in the future, but
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
826 we will explain two here.
827
828 ### Memory allocation
829
dfad1c3 @AndreasVoellmy Edits to Memory allocation section.
AndreasVoellmy authored
830 When receiving and sending packets, buffers are allocated. These buffers are
831 allocated as so-called "pinned" byte arrays, so that they can be passed to C
832 procedures, such as recv() and send(). Since it is best to receive or send as
833 much data as possible in each system call, these buffers are moderately sized.
834 Unfortunately, GHC's method for allocating large (larger than 409 bytes in 64
835 bit machines) pinned byte arrays takes a global lock in the runtime system.
836 This lock may become a bottleneck when scaling beyond 16 cores, if each core
837 user thread frequently allocates such buffers.
838
839 We performed an initial investigation of the performance impact of large pinned
840 array allocation for HTTP response header generation. For this purpose, GHC
841 provides *eventlog* which can record timestamps of each event. We surrounded a
842 memory allocation function with the function to record a user event. Then we
843 complied `mighty` with it and took eventlog. The resulting eventlog is illustrated
844 as follows:
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
845
fa44bc3 @kazu-yamamoto absolute URL for two figs.
kazu-yamamoto authored
846 ![eventlog](https://raw.github.com/snoyberg/posa-chapter/master/eventlog.png)
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
847
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
848 Brick red bars indicates the event created by us.
849 So, the area surrounded by two bars is the time consumed by memory allocation.
d467f67 @kazu-yamamoto adding text.
kazu-yamamoto authored
850 It is about 1/10 of an HTTP session.
856f316 @kazu-yamamoto brushing up.
kazu-yamamoto authored
851 We are discussing how to implement memory allocation without locks.
d467f67 @kazu-yamamoto adding text.
kazu-yamamoto authored
852
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
853 ### New thundering herd
854
855 Thundering herd is an old but new problem.
856 Suppose that processes/native threads are pre-forked to share a listening socket.
857 They call `accept()` on the socket.
858 When a connection is created, old Linux and FreeBSD
859 wakes up all of them.
62b8e46 @snoyberg More tweaks
authored
860 And only one can accept it and the others sleep again.
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
861 Since this causes many context switches,
62b8e46 @snoyberg More tweaks
authored
862 we face a performance problem.
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
863 This is called *thundering* *herd*.
864 Recent Linux and FreeBSD wakes up only one process/native thread.
865 So, this problem became a thing of the past.
866
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
867 Recent network servers tend to use the `epoll` family.
62b8e46 @snoyberg More tweaks
authored
868 If workers share a listening socket and
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
869 they manipulate accept connections through the `epoll` family,
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
870 thundering herd appears again.
871 This is because
390e04a @kazu-yamamoto brushing up timeout sections.
kazu-yamamoto authored
872 the semantics of the `epoll` family is to notify
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
873 all processes/native threads.
62b8e46 @snoyberg More tweaks
authored
874 `nginx` and `mighty` are victims of this new thundering herd.
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
875
62b8e46 @snoyberg More tweaks
authored
876 The parallel IO manager is free from the new thundering herd problem.
b11e16d @kazu-yamamoto brushing up.
kazu-yamamoto authored
877 In this architecture,
878 only one IO manager accepts new connections through the `epoll` family.
879 And other IO managers handle established connections.
d467f67 @kazu-yamamoto adding text.
kazu-yamamoto authored
880
3ebd29a @kazu-yamamoto adding warp.md.
kazu-yamamoto authored
881 ## Conclusion
c4920e7 @snoyberg Wrote request parsing and conclusion
authored
882
883 Warp is a versatile web server library,
884 providing efficient HTTP communication for a wide range of use cases.
885 In order to achieve its high performance,
886 optimizations have been performed at many levels,
887 including network communications, thread management, and request parsing.
888
889 Haskell has proven to be an amazing language for writing such a codebase.
890 Features like immutability by default make it easier to write thread-safe code
891 and avoid extra buffer copying.
892 The multi-threaded runtime drastically simplifies the process of writing event-driven code.
893 And GHC's powerful optimizations mean that in many cases,
894 we can write high-level code and still reap the benefits of high performance.
895 Yet with all of this performance, our codebase is still relatively tiny
896 (under 1300 SLOC at time of writing).
897 If you are looking to write maintainable, efficient, concurrent code,
898 Haskell should be a strong consideration.
Something went wrong with that request. Please try again.