Skip to content

Commit

Permalink
brushing up timeout sections.
Browse files Browse the repository at this point in the history
  • Loading branch information
kazu-yamamoto committed Nov 2, 2012
1 parent b11e16d commit 390e04a
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 80 deletions.
39 changes: 16 additions & 23 deletions warp.html
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ <h3 id="user-threads">User threads</h3>
<div class="figure">
<img src="4.png" alt="User threads" /><p class="caption">User threads</p>
</div>
<p>As of this writing, <code>mighty</code> uses the prefork technique to fork processes to utilize cores and Warp does not have this functionality. Haskell community is now developing parallel IO manager. If it will be merged to GHC, Warp itself can use this architecture without any modifications.</p>
<p>As of this writing, <code>mighty</code> uses the prefork technique to fork processes to utilize cores and Warp does not have this functionality. Haskell community is now developing parallel IO manager. A Haskell program with parallel IO manager is executed as a single process and multiple IO managers run as native threads to utilize cores. And user threads are executed on one of cores. If it will be merged to GHC, Warp itself can use this architecture without any modifications.</p>
<h2 id="warps-architecture">Warp's architecture</h2>
<p>Warp is an HTTP engine for WAI (Web Application Interface). It runs WAI applications over HTTP. As we described before both Yesod and <code>mighty</code> are examples of WAI applications as illustrated in Fix XXX.</p>
<div class="figure">
Expand All @@ -43,14 +43,14 @@ <h2 id="warps-architecture">Warp's architecture</h2>
<div class="figure">
<img src="warp.png" alt="Warp" /><p class="caption">Warp</p>
</div>
<p>The user thread repeats this procedure if necessary and terminates by itself when the connection is closed by the peer.</p>
<p>The user thread repeats this procedure if necessary and terminates by itself when the connection is closed by the peer. It is also killed by the dedicated user thread for timeout if a significant amount of data is not received in a specified period.</p>
<h2 id="performance-of-warp">Performance of Warp</h2>
<p>Before we explain how to improve the performance of Warp, we would like to show the results of our benchmark. We measured throughput of <code>mighty</code> 2.8.2 (with Warp x.x.x) and nginx 1.2.4. Our benchmark environment is as follows:</p>
<ul>
<li>One &quot;12 cores&quot; machine (Intel Xeon E5645, two sockets, 6 cores per 1 CPU, two QPI between two CPUs)</li>
<li>Linux version 3.2.0 (Ubuntu 12.04 LTS), which is running directly on the machine (i.e. without a hypervisor)</li>
</ul>
<p>We tested several benchmark tools in the past and our favorite one was <code>httperf</code>. Since it uses the <code>select()</code> system call and is just a single process program, it reaches its performance limits when we try to measure HTTP servers on multi-cores. So, we switched to <code>weighttp</code>, which is based on the <code>epoll</code> system call family and can use multiple native threads. We used <code>weighttp</code> as follows:</p>
<p>We tested several benchmark tools in the past and our favorite one was <code>httperf</code>. Since it uses <code>select()</code> and is just a single process program, it reaches its performance limits when we try to measure HTTP servers on multi-cores. So, we switched to <code>weighttp</code>, which is based on the <code>epoll</code> family and can use multiple native threads. We used <code>weighttp</code> as follows:</p>
<pre><code>weighttp -n 100000 -c 1000 -t 3 -k http://127.0.0.1:8000/</code></pre>
<p>This means that 1,000 HTTP connections are established and each connection sends 100 requests. 3 native threads are spawn to carry out these jobs..</p>
<p>For all requests, the same <code>index.html</code> file is returned. We used <code>nginx</code>'s <code>index.html</code> whose size is 151 bytes. As &quot;127.0.0.1&quot; suggests, We measured web servers locally. We should have measured from a remote machine but we don't have suitable environment at this moment. (NOTE: I'm planning to do benchmark using two machines soon.)</p>
Expand All @@ -74,17 +74,14 @@ <h2 id="key-ideas">Key ideas</h2>
<li>Avoiding locks</li>
</ol>
<h3 id="issuing-as-few-system-calls-as-possible">Issuing as few system calls as possible</h3>
<p>If a system call is issued, CPU time is given to kernel and all user threads stop. So, we need to use as fewe system calls as possible. For a HTTP session to get a static file, Warp calls <code>recv()</code>, <code>send()</code> and <code>sendfile()</code> only (Fig warp.png). <code>open()</code>, <code>stat()</code>, <code>close()</code> and other system calls can be committed thanks to cache mechanism described later.</p>
<p>If a system call is issued, CPU time is given to kernel and all user threads stop. So, we need to use as few system calls as possible. For a HTTP session to get a static file, Warp calls <code>recv()</code>, <code>send()</code> and <code>sendfile()</code> only (Fig warp.png). <code>open()</code>, <code>stat()</code>, <code>close()</code> and other system calls can be committed thanks to cache mechanism described later.</p>
<p>We can use <code>strace</code> to see what system calls are actually used. When we observed behavior of <code>nginx</code> with <code>strace</code>, we noticed that it uses <code>accept4()</code>, about which we don't know at that time.</p>
<p>Using Haskell's standard network library, a listening socket is created with the non-blocking flag set. When a new connection is accepted from the listening socket, it is necessary to set the corresponding socket as non-blocking, too. The <code>network</code> package implements this by calling <code>fcntl()</code> twice: one is to get the current flags and the other is to set the flags with the non-blocking flag <em>ORed</em>.</p>
<p>On Linux, the non-block flag of a connected socket is always unset even if its listening socket is non-blocking. The <code>accept4()</code> system call is an extension version of <code>accept()</code> on Linux. It can set the non-blocking flag when accepting. So, if we use <code>accept4()</code>, we can avoid two unnecessary <code>fcntl()</code>s. Our patch to use <code>accept4()</code> on Linux has been already merged to the network library.</p>
<p>On Linux, the non-block flag of a connected socket is always unset even if its listening socket is non-blocking. <code>accept4()</code> is an extension version of <code>accept()</code> on Linux. It can set the non-blocking flag when accepting. So, if we use <code>accept4()</code>, we can avoid two unnecessary <code>fcntl()</code>s. Our patch to use <code>accept4()</code> on Linux has been already merged to the network library.</p>
<h3 id="specialization-and-avoiding-re-calculation">Specialization and avoiding re-calculation</h3>
<p>GHC profiling criterion Char8 http-date</p>
<h3 id="avoiding-locks">Avoiding locks</h3>
<ul>
<li>Talking about parallel IO manager</li>
</ul>
<p>TBD</p>
<p>Unnecessary locks are evil for programming. Our code sometime uses unnecessary locks imperceptibly because runtime systems or libraries uses locks deep inside. To implement high-performance server, we need to identify locks and avoid locks if possible. It is worth pointing out that locks will become much more critical under the parallel IO managers. We will talk how to identify and avoid locks in Section XXX and Section XXX.</p>
<h2 id="http-request-parser">HTTP request parser</h2>
<ul>
<li>Parser generator vs handmade parser</li>
Expand All @@ -109,10 +106,11 @@ <h3 id="sending-header-and-body-together">sending header and body together</h3>
<div class="figure">
<img src="tcpdump.png" alt="Packet sequence of old Warp" /><p class="caption">Packet sequence of old Warp</p>
</div>
<p>To send them in a single TCP packet (when possible), new Warp switched from <code>writev()</code> to <code>send()</code>. It uses the <code>send()</code> system call with the <code>MSG_MORE</code> flag to store a header and the <code>sendfile()</code> system call to send both the stored header and a file. This made the throughput at least 100 times faster.</p>
<p>To send them in a single TCP packet (when possible), new Warp switched from <code>writev()</code> to <code>send()</code>. It uses <code>send()</code> with the <code>MSG_MORE</code> flag to store a header and <code>sendfile()</code> to send both the stored header and a file. This made the throughput at least 100 times faster.</p>
<h2 id="clean-up-with-timers">Clean-up with timers</h2>
<p>This section explain how to implement connection timeout and how to cache file descriptors.</p>
<h3 id="timers-for-connections">Timers for connections</h3>
<p>To prevent slowloris attacks, Warp kills a user thread, which communicates with a client, if the client does not send a significant amount of data for a specified period (30 seconds by default).</p>
<p>To prevent slowloris attacks, a dedicated user thread kills a user thread, which communicates with a client, if the client does not send a significant amount of data for a specified period (30 seconds by default).</p>
<p>TBD: System.Timeout</p>
<p>The heart of Warp's timeout system is the following two points:</p>
<ul>
Expand All @@ -132,17 +130,12 @@ <h3 id="timers-for-connections">Timers for connections</h3>
atomicModifyIORef ref (\ys -&gt; (merge xs&#39; ys, ()))</code></pre>
<p>The timeout manager atomically swaps the list with an empty list. Then it manipulates the list by turning status and/or removing unnecessary status for killed Haskell threads. During this process, new connections may be created and their status are inserted with <code>atomicModifyIORef</code> by corresponding Haskell threads. Then, the timeout manager atomically merges the pruned list and the new list.</p>
<h3 id="timers-for-file-descriptors">Timers for file descriptors</h3>
<p>Warp's timeout approach is safe to reuse as a cache mechanism for file descriptors because it does not use reference counters. However, we cannot simply reuse Warp's timeout code for some reasons:</p>
<p>Each Haskell thread has its own status. So, status is not shared. But we would like to cache file descriptors to avoid <code>open()</code> and <code>close()</code> by sharing. So, we need to search a file descriptor for a requested file from cached ones. Since this look-up should be fast, we should not use a list. You may think <code>Data.Map</code> can be used. Yes, its look-up is O(log N) but there are two reasons why we cannot use it:</p>
<ol style="list-style-type: decimal">
<li><code>Data.Map</code> is a finite map which cannot contain multiple values for a single key.</li>
<li><code>Data.Map</code> does not provide a fast pruning method.</li>
</ol>
<p>Problem 1: because requests are received concurrently, two or more file descriptors for the same file may be opened. So, we need to store multiple file descriptors for a single file name. We can solve this by re-implementing <code>Data.Map</code> to hold a non-empty list. This is technically called a &quot;multimap&quot;.</p>
<p>Problem 2: <code>Data.Map</code> is based on a binary search tree called &quot;weight balanced tree&quot;. To the best of my best knowledge, there is no way to prune the tree directly. You may also think that we can convert the tree to a list (<code>toList</code>), then prune it, and convert the list back to a new tree (<code>fromList</code>). The cost of the first two operations is O(N) but that of the last one is O(N log N) unfortunately.</p>
<p>One day, I remembered Exercise 3.9 of &quot;Purely Functional Data Structure&quot; - to implement <code>fromOrdList</code> which constructs a red-black tree from an ordered list in O(N). My friends and I have a study meeting on this book every month. To solve this problem, one guy found a paper by Ralf Hinze, &quot;Constructing Red-Black Trees&quot;. If you want to know its concrete algorithm, please read this paper.</p>
<p>Since red-black trees are binary search trees, we can implement multimap by combining it and non-empty lists. Fortunately, the list created with <code>toList</code> is sorted. So, we can use <code>fromOrdList</code> to convert the sorted list to a new red-black tree. Now we have a multimap whose look-up is O(log N) and pruning is O(N).</p>
<p>The cache mechanism has already been merged into the master branch of Warp, and is awaiting release.</p>
<p>Let's consider the case where Warp sends the entire file by <code>sendfile()</code>. Unfortunately, we need to call <code>stat()</code> to know the size of the file because <code>sendfile()</code> on Linux requires the caller to specify how many bytes to be sent (<code>sendfile()</code> on FreeBSD/MacOS has magic number '0' which indicates the end of file).</p>
<p>If WAI applications know the file size, Warp can avoid <code>stat()</code>. It is easy for WAI applications to cache file information such as size and modification time. If cache timeout is fast enough (say 10 seconds), the risk of cache inconsistency problem is not serious. And because we can safely clean up the cache, we don't have to worry about leakage.</p>
<p>Since <code>sendfile()</code> requires a file descriptor, the naive sequence to send a file is <code>open()</code>, <code>sendfile()</code> repeatedly if necessary, and <code>close()</code>. In this section, we consider how to cache file descriptors to avoid <code>open()</code> and <code>close()</code>. Caching file descriptors should work as follows: If a client requests that a file be sent, a file descriptor is opened by <code>open()</code>. And if another client requests the same file shortly thereafter, the file descriptor is reused. At a later time, the file descriptor is closed by <code>close()</code> if no user thread uses it.</p>
<p>A typical tactic for this case is reference counter. But we was not sure that we could implement a robust mechanism for a reference counter. What happens if a user thread is killed for unexpected reasons? If we fail to decrement its reference counter, the file descriptor leaks. We noticed that the scheme of connection timeout is safe to reuse as a cache mechanism for file descriptors because it does not use reference counters. However, we cannot simply reuse Warp's timeout code for some reasons:</p>
<p>Each user thread has its own status. So, status is not shared. But we would like to cache file descriptors to avoid <code>open()</code> and <code>close()</code> by sharing. So, we need to search a file descriptor for a requested file from cached ones. Since this look-up should be fast, we should not use a list. Also, because requests are received concurrently, two or more file descriptors for the same file may be opened. So, we need to store multiple file descriptors for a single file name. This is technically called a <em>multimap</em>.</p>
<p>We implemented a multimap whose look-up is O(log N) and pruning is O(N) with red-black trees whose node contains a non-empty list. Since a red-black trees is one of binary search trees, look-up is O(log N) where N is the number of nodes. Also, we can translate it into an order list in O(log N). In our implementation, pruning nodes which contains a file descriptor to be closed is also done during this procedure. An algorithm is known, which converts a order list to a red-black tree in O(N).</p>
<h2 id="future-work">Future work</h2>
<p>We have some items to improve Warp in the future but we will explain two here.</p>
<h3 id="memory-allocation">Memory allocation</h3>
Expand All @@ -154,6 +147,6 @@ <h3 id="memory-allocation">Memory allocation</h3>
<p>Brick red bars indicates the event created by <code>traceEventIO</code>. The area surrounded by two bars is the time consumed by <code>mallocByteString</code>. It is about 1/10 of an HTTP session. We are confident that the same thing happens when allocating receiving buffers.</p>
<h3 id="new-thundering-herd">New thundering herd</h3>
<p>Thundering herd is an old but new problem. Suppose that processes/native threads are pre-forked to share a listening socket. They call <code>accept()</code> on the socket. When a connection is created, old Linux and FreeBSD wakes up all of them. And only one can accept it and the others sleeps again. Since this causes many context switches, we face performance problem. This is called <em>thundering</em> <em>herd</em>. Recent Linux and FreeBSD wakes up only one process/native thread. So, this problem became a thing of the past.</p>
<p>Recent network servers tend to use the <code>epoll</code>/<code>kqueue</code> family. If worker processes share a listen socket and they manipulate accept connections through the <code>epoll</code>/<code>kqueue</code> family, thundering herd appears again. This is because the semantics of the <code>epoll</code>/<code>kqueue</code> family is to notify all processes/native threads. <code>nginx</code> and <code>mighty</code> are victims of new thundering herd.</p>
<p>Recent network servers tend to use the <code>epoll</code> family. If worker processes share a listen socket and they manipulate accept connections through the <code>epoll</code> family, thundering herd appears again. This is because the semantics of the <code>epoll</code> family is to notify all processes/native threads. <code>nginx</code> and <code>mighty</code> are victims of new thundering herd.</p>
<p>The parallel IO manager is free from new thundering herd. In this architecture, only one IO manager accepts new connections through the <code>epoll</code> family. And other IO managers handle established connections.</p>
<h2 id="conclusion">Conclusion</h2>
Loading

0 comments on commit 390e04a

Please sign in to comment.