Streaming MapReduce with Scalding and Storm
Clone or download
ttim Externalize `Monoid` in flat map and summer bolts (#766)
Currently `Monoid` instance passed to `sumByKey` is serialized through java serialization (for summingbird-storm platform), while other user's code (e.g. flatMap functions) is serialized through `Externalizer` (which tries java serialization and kryo in case it didn't work). Seems like this behaviour should be consistent and we should serialize `Monoid` through `Externalizer` as well.

In this PR I've added test for this and fixed it via putting `Monoid` into `Externalizer` box.
Latest commit d5322b3 Jul 3, 2018
Failed to load latest commit information.
logo Add official summingbird logo May 20, 2014
project Fix logging in storm tests (#723) Jun 1, 2017
summingbird-batch-hadoop/src Remove unused imports, use latest scala (#751) Jan 22, 2018
summingbird-batch/src Support Scala 2.12 (#724) Jul 25, 2017
summingbird-builder/src Support Scala 2.12 (#724) Jul 25, 2017
summingbird-chill/src Remove unused imports (#718) Mar 8, 2017
summingbird-client/src Remove unused imports, use latest scala (#751) Jan 22, 2018
summingbird-core-test/src Improve Producer equals methods Feb 13, 2018
summingbird-core/src Use dagon for dag rewrites (#758) Feb 15, 2018
summingbird-example/src Support Scala 2.12 (#724) Jul 25, 2017
summingbird-online/src Externalize `Monoid` in flat map and summer bolts (#766) Jul 3, 2018
summingbird-scalding-test/src Support Scala 2.12 (#724) Jul 25, 2017
summingbird-scalding/src/main/scala/com/twitter/summingbird/scalding clean up some closures Feb 28, 2018
summingbird-storm-test/src Externalize `Monoid` in flat map and summer bolts (#766) Jul 3, 2018
summingbird-storm/src Externalize `Monoid` in flat map and summer bolts (#766) Jul 3, 2018
.gitignore Update the build to .sbt style Feb 2, 2016
.jvmopts Shrink jvm memory props (#726) Jun 5, 2017
.travis.yml Remove unused imports, use latest scala (#751) Jan 22, 2018 Updates for the 0.9.1 release Nov 16, 2015 Remove rubanm from committers list (#748) Sep 14, 2017 link. Aug 27, 2013
LICENSE Project template. Oct 23, 2012
NOTICE Project template. Oct 23, 2012 update build to prep for 2.12 (#700) Dec 12, 2016
build.sbt Use dagon for dag rewrites (#758) Feb 15, 2018
sbt update build to prep for 2.12 (#700) Dec 12, 2016
version.sbt Issue #675.Configurable elimination of FlatMapNode by enhancing Sourc… ( Aug 3, 2016


Build Status Codecov branch Latest version Chat

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Summingbird Logo

While a word-counting aggregation in pure Scala might look like this:

  def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
    source.flatMap { sentence =>
      toWords(sentence).map(_ -> 1L)
    }.foreach { case (k, v) => store.update(k, store.get(k) + v) }

Counting words in Summingbird looks like this:

  def wordCount[P <: Platform[P]]
    (source: Producer[P, String], store: P#Store[String, Long]) =
      source.flatMap { sentence =>
        toWords(sentence).map(_ -> 1L)

The logic is exactly the same, and the code is almost the same. The main difference is that you can execute the Summingbird program in "batch mode" (using Scalding), in "realtime mode" (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.

Summingbird provides you with the primitives you need to build rock solid production systems.

Getting Started: Word Count with Twitter

The summingbird-example project allows you to run the wordcount program above on a sample of Twitter data using a local Storm topology and memcache instance. You can find the actual job definition in ExampleJob.scala.

First, make sure you have memcached installed locally. If not, if you're on OS X, you can get it by installing Homebrew and running this command in a shell:

brew install memcached

When this is finished, run the memcached command in a separate terminal.

Now you'll need to set up access to the Twitter Streaming API. This blog post has a great walkthrough, so open that page, head over to and get your various keys and tokens. Once you have these, clone the Summingbird repository:

git clone
cd summingbird

And open StormRunner.scala in your editor. Replace the dummy variables under config variable with your auth tokens:

lazy val config = new ConfigurationBuilder()
    .setJSONStoreEnabled(true) // required for JSON serialization

You're all ready to go! Now it's time to unleash Storm on your Twitter stream. Make sure the memcached terminal is still open, then start Storm from the summingbird directory:

./sbt "summingbird-example/run --local"

Storm should puke out a bunch of output, then stabilize and hang. This means that Storm is updating your local memcache instance with counts of every word that it sees in each tweet.

To query the aggregate results in Memcached, you'll need to open an SBT repl in a new terminal:

./sbt summingbird-example/console

At the launched repl, run the following:

scala> import com.twitter.summingbird.example._
import com.twitter.summingbird.example._

scala> StormRunner.lookup("i")
<memcache store loading elided>
res0: Option[Long] = Some(5)

scala> StormRunner.lookup("i")
res1: Option[Long] = Some(52)

Boom. Counts for the word "i" are growing in realtime.

See the wiki page for a more detailed explanation of the configuration required to get this job up and running and some ideas for where to go next.


To learn more and find links to tutorials and information around the web, check out the Summingbird Wiki.

The latest ScalaDocs are hosted on Summingbird's Github Project Page.


Discussion occurs primarily on the Summingbird mailing list. Issues should be reported on the GitHub issue tracker. Simpler issues appropriate for first-time contributors looking to help out are tagged "newbie".

IRC: freenode channel #summingbird

Follow @summingbird on Twitter for updates.

Please feel free to use the beautiful Summingbird logo artwork anywhere.

Get Involved + Code of Conduct

Pull requests and bug reports are always welcome!

We use a lightweight form of project governence inspired by the one used by Apache projects. Please see Contributing and Committership for our code of conduct and our pull request review process. The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.

The current list of active committers (who can +1 a pull request) can be found here: Committers

A list of contributors to the project can be found here: Contributors


Summingbird modules are published on maven central. The current groupid and version for all modules is, respectively, "com.twitter" and 0.9.1.

Current published artifacts are

  • summingbird-core_2.11
  • summingbird-core_2.10
  • summingbird-batch_2.11
  • summingbird-batch_2.10
  • summingbird-client_2.11
  • summingbird-client_2.10
  • summingbird-storm_2.11
  • summingbird-storm_2.10
  • summingbird-scalding_2.11
  • summingbird-scalding_2.10
  • summingbird-builder_2.11
  • summingbird-builder_2.10

The suffix denotes the scala version.

Authors (alphabetically)


Copyright 2013 Twitter, Inc.

Licensed under the Apache License, Version 2.0: