Permalink
Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
161 lines (141 sloc) 10.4 KB
---
timestamp: Fri 13 Nov 2009 08:29:30 PM PST
title: in which things are mapped, but also reduced
tags: "clojure"
id: 130
content: |-
<p>A while ago Tim Bray had a project called
the <a href="http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder">Wide
Finder</a> to collect implementations of a log parsing problem in
different languages to see which ones made it easy to take
advantage of massively-parallel hardware. The idea is to accept a
web server log file and return statistics for which pages have
been requested the
most. <a href="http://www.tbray.org/ongoing/When/200x/2009/11/11/Clojure-References">Yesterday</a>
he posted a follow-up in Clojure using refs.</p>
<p>While his version is interesting for someone just getting into
the language because it uses refs (probably the shiniest piece of
the language), I think a map/reduce approach is a little more
idiomatic since it can be done with no explicit state change.</p>
<p>I'll step
through <a href="http://p.hagelb.org/wide_finder.clj.html">my
implementation</a> piece-by-piece. Note that it is naïve and
could be optimized for speed at the expense of straightfowardness,
especially with regard to reading from disk;
my intent here is simply to explain the functional approach.</p>
<p><b>Update</b>: Commenters have posted <a
href="http://pastie.org/pastes/717940">a version that totally
smokes my agent-based implementation</a> in terms of performance
and simplicity. I'm leaving mine up because I think it's a fun romp
in the land of higher-order functions; if you think having
higher-order functions means "I can pass a block to
<code>Array#map</code> in Ruby" then you're in for a treat. If you
are looking for a hard-core walkthrough of high-performance
coding, your best bet is <a
href="http://meshy.org/2009/12/13/widefinder-2-with-clojure.html">Alex
Osborne's blog post</a> on Wide Finder.</p>
<pre class="code">
<span class="esk-paren">(</span><span class="keyword">ns</span> wide.finder
<span class="string">"A basic map/reduce approach to the wide finder using agents.
Optimized for being idiomatic and readable rather than speed."</span>
<span class="esk-paren">(</span><span class="builtin">:use</span> [clojure.java.io <span class="builtin">:only</span> [reader]]<span class="esk-paren">))</span>
</pre>
<p>Every Clojure file opens with a call to the <code>ns</code>
macro. This defines a namespace (<code>wide.finder</code>) and
states any dependencies it may have on other namespaces, in this
case the <code>reader</code> function from
the <code>clojure.java.io</code> namespace. Omitting
the <code>:only</code> clause would cause it to make all
the <code>io</code> vars available in this namespace,
but it's usually better to be explicit about what you need.</p>
<p>We'll start with the entry point. The <code>find-widely</code>
function below takes a filename and a number of agents to work
with. Agents can be thought of as independent asynchronous workers
that share a thread pool and keep a single state value. They are
initialized with the <code>agent</code> function that takes a
starting state. We'll be using each agent to keep a map of pages
to hit counts, so we map this function over a list
of <code>n</code> empty maps generated by <code>repeat</code> to
get our list of initialized agents.</p>
<pre class="code"><span class="esk-paren">(</span><span class="keyword">defn</span> <span class="function-name">find-widely</span> [filename n]
<span class="esk-paren">(</span><span class="keyword">let</span> [agents <span class="esk-paren">(</span><span class="builtin">map</span> agent <span class="esk-paren">(</span><span class="builtin">repeat</span> n {}<span class="esk-paren">))</span>]
<span class="esk-paren">(</span><span class="keyword">dorun</span> <span class="esk-paren">(</span><span class="builtin">map</span> #<span class="esk-paren">(</span><span class="builtin">send</span> %1 count-line %2<span class="esk-paren">)</span>
<span class="esk-paren">(</span><span class="builtin">cycle</span> agents<span class="esk-paren">)</span>
<span class="esk-paren">(</span><span class="builtin">line-seq</span> <span class="esk-paren">(</span>reader filename<span class="esk-paren">))))</span>
<span class="esk-paren">(</span><span class="keyword">apply</span> await agents<span class="esk-paren">)</span>
<span class="esk-paren">(</span><span class="builtin">apply</span> merge-with + <span class="esk-paren">(</span><span class="builtin">map</span> deref agents<span class="esk-paren">))))</span>
</pre>
<p>Once the agents are initialized, we send them
work. Our <code>line-seq</code> sets up a lazy sequence of lines
from the file. We map over this sequence together with an infinite
loop of the <code>agents</code> we construct
using <code>cycle</code>. The <code>map</code> function can loop
over multiple sequences in parallel and will stop when the
shorter one runs out, so this is just a way of pairing each line
in the file with an agent in a round-robin fashion.</p>
<pre class="code"><span class="esk-paren">(</span><span class="keyword">dorun</span> <span class="esk-paren">(</span><span class="builtin">map</span> #<span class="esk-paren">(</span><span class="builtin">send</span> %1 count-line %2<span class="esk-paren">)</span>
<span class="esk-paren">(</span><span class="builtin">cycle</span> agents<span class="esk-paren">)</span>
<span class="esk-paren">(</span><span class="builtin">line-seq</span> <span class="esk-paren">(</span>reader filename<span class="esk-paren">))))</span></pre>
<p>The code that does the mapping is the anonymous
function <code>#(send %1 count-line %2)</code>, which adds a call
to the <code>count-line</code> function to the agent's internal
queue. This could be written in more traditional lambda form
as <code>(fn [a line] (send a count-line line)</code>;
the <code>#()</code> form is simply
shorthand. The <code>count-line</code> function will be called
with two arguments: the agent's current value and the next line in
the seq. That function will return a new value for the agent.</p>
<pre class="code"><span class="esk-paren">(</span><span class="keyword">doseq</span> [a agents] <span class="esk-paren">(</span>await a<span class="esk-paren">))</span></pre>
<p>Once all the work has been queued up, it's necessary to
call <code>await</code> on each agent to make sure it has a chance
to finish its queue. If we used <code>map</code> here, it would
be a no-op, since <code>map</code> merely creates a lazy sequence;
it does not actually perform the function calls until the value is
needed. Since it's not in the tail-call position, the value would
simply be discarded. Note that the same problem
would occur in the previous call to map, but wrapping it in
a <code>dorun</code> call forces the lazy seq to be evaluated.</p>
<pre class="code"><span class="esk-paren">(</span><span class="builtin">apply</span> merge-with + <span class="esk-paren">(</span><span class="builtin">map</span> deref agents<span class="esk-paren">))</span></pre>
<p>Once that's done we can call <code>deref</code> on each agent to
get its value. But this gives us <code>n</code> maps rather than a
list of totals, so we need to merge the values.
<code>merge-with</code> is a special case of <code>reduce</code>
that assumes it works on a sequence of maps and merges key
collisions using the provided function, in this
case <code>+</code>. This gives us a map of page names to hit
counts.</p>
<pre class="code"><span class="esk-paren">(</span><span class="keyword">defn</span> <span class="function-name">count-line</span> [counts line]
<span class="esk-paren">(</span><span class="keyword">if-let</span> [[_ hit] <span class="esk-paren">(</span><span class="builtin">re-find</span> #<span class="string">"GET /(\d+) "</span> line<span class="esk-paren">)</span>]
<span class="esk-paren">(</span>update-in counts [hit] inc-or-init<span class="esk-paren">)</span>
counts<span class="esk-paren">))</span></pre>
<p>Finally we have the function that actually performs the
counting. As mentioned, it takes an agent's state (which is the current
map of pages to hits) and a line from the log file. If the line
matches the regex <code>#"GET /(\d+) "</code>, then we return an
updated version of the <code>counts</code> map that increments the
entry corresponding to the hit. The other interesting thing here
is that <code>if-let</code> uses <em>destructuring</em>: by
binding the return value of <code>re-find</code> to a vector form,
it splits the value in two. The first element (the full matched
string) is bound to the unused <code>_</code> local, and the
second element (the match group corresponding to the actual page
path) is bound to <code>hit</code>.</p>
<pre class="code"><span class="esk-paren">(</span><span class="keyword">defn</span> <span class="function-name">inc-or-init</span> [i]
<span class="esk-paren">(</span><span class="keyword">if</span> i <span class="esk-paren">(</span><span class="builtin">inc</span> i<span class="esk-paren">)</span> 1<span class="esk-paren">))</span>
</pre>
<p>The only piece left is the tiny <code>inc-or-init</code> function
that increments the counter given it but treats nil as zero. This
would be unnecessary if we could construct hash-maps with custom default
values, which a perusal of the implementation of PersistentHashMap.java
seems to indicate is supported, though it's not exposed anywhere
through a Clojure function. This is the piece of the puzzle that
I'm the least happy with, but it may be possible to eliminate it.</p>
<p>In any case, this is a simple example of how to break up a
commonplace problem using a classic map/reduce strategy and
immutable data structures. I have no idea how it compares in terms
of performance to the other Wide Finder implementations since the
logs I have for this site are not exactly the hundred-megabyte
blockbuster kind of logs
that <a href="http://tbray.org/ongoing">ongoing</a> enjoys, but
the fact that it can be parallelized in twelve lines is a
testament to the expressiveness of the language.</p>