Skip to content
This repository has been archived by the owner on Jun 9, 2021. It is now read-only.

Word Count

tdunning edited this page Sep 14, 2010 · 1 revision

Here is a quick description of how to code up the perennial map-reduce demo program for counting words. The idea is that we have lines of text that we have to tokenize and then count the words. This example is to be found in the class WordCountTest in Plume.

So we start with PCollection<String> lines for input. For each line, we split the line
into words and emit them as a separate record:

   
  PCollection<String> words = lines
    .map(new DoFn<String, String>() {
      @Override
      public void process(String x, EmitFn<String> emitter) {
        for (String word : onNonWordChar.split(x)) {
          emitter.emit(word);
        }
      }
    }, collectionOf(strings()));

Then we emit each word as a key for a PTable with a count of 1. This is just the same as most word-count implementations except that we have separated the tokenization from the emission of the original counts. We could have put them together into a single map operation, but the optimizer will do that for us (when it exists) so keeping the functions modular is probably better.

  PTable<String, Integer> wc = words
    .map(new DoFn<String, Pair<String, Integer>>() {
      @Override
      public void process(String x, 
                         EmitFn<Pair<String, Integer>> emitter) {
         emitter.emit(Pair.create(x, 1));
      }
    }, tableOf(strings(), integers()))

Then we group by word

  .groupByKey()

And do the counting. Note how we don’t have to worry about the details of using a combiner or a reducer.

  .combine(new CombinerFn<Integer>() {
     @Override
     public Integer combine(Iterable<Integer> counts) {
       int sum = 0;
       for (Integer k : counts) {
         sum += k;
       }
       return sum;
     }
   });

In all, it takes 27 lines to implement a slightly more general word-count than the one in the Hadoop tutorial. If we were compare apples to apples, this could would probably be a few lines shorter. The original word-count demo was 210 lines to do the same thing.
Clone this wiki locally