Streaming MapReduce with Scalding and Storm
Scala Shell Java
Permalink
Failed to load latest commit information.
logo Add official summingbird logo May 20, 2014
project Clean up all but the transient warnings (#708) Jan 17, 2017
summingbird-batch-hadoop/src Fix FileSystem.get issue Jan 30, 2016
summingbird-batch/src Merge issue Oct 20, 2015
summingbird-builder/src Clean up all but the transient warnings (#708) Jan 17, 2017
summingbird-chill/src Do not use set references to more closely match our other Kryo usages Oct 19, 2015
summingbird-client/src Incorporate latest storehaus release. Jun 14, 2016
summingbird-core-test/src Clean up all but the transient warnings (#708) Jan 17, 2017
summingbird-core/src Clean up all but the transient warnings (#708) Jan 17, 2017
summingbird-example/src Incorporate latest storehaus release. Jun 14, 2016
summingbird-online/src Clean up all but the transient warnings (#708) Jan 17, 2017
summingbird-scalding-test/src Add some Execution support in scalding (#674) Jul 20, 2016
summingbird-scalding/src/main/scala/com/twitter/summingbird/scalding Fix the bug with reducer estimators not working (#714) Feb 11, 2017
summingbird-storm-test/src Clean up all but the transient warnings (#708) Jan 17, 2017
summingbird-storm/src Clean up all but the transient warnings (#708) Jan 17, 2017
.gitignore Update the build to .sbt style Feb 2, 2016
.jvmopts update build to prep for 2.12 (#700) Dec 12, 2016
.travis.yml update build to prep for 2.12 (#700) Dec 12, 2016
CHANGES.md Updates for the 0.9.1 release Nov 16, 2015
COMMITTERS.md Add Sam Ritchie as a COMMITTER Sep 6, 2016
CONTRIBUTING.md link. Aug 27, 2013
LICENSE Project template. Oct 23, 2012
NOTICE Project template. Oct 23, 2012
README.md update build to prep for 2.12 (#700) Dec 12, 2016
build.sbt Clean up all but the transient warnings (#708) Jan 17, 2017
sbt update build to prep for 2.12 (#700) Dec 12, 2016
version.sbt Issue #675.Configurable elimination of FlatMapNode by enhancing Sourc… ( Aug 3, 2016

README.md

Summingbird

Build Status Codecov branch Latest version Chat

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Summingbird Logo

While a word-counting aggregation in pure Scala might look like this:

  def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
    source.flatMap { sentence =>
      toWords(sentence).map(_ -> 1L)
    }.foreach { case (k, v) => store.update(k, store.get(k) + v) }

Counting words in Summingbird looks like this:

  def wordCount[P <: Platform[P]]
    (source: Producer[P, String], store: P#Store[String, Long]) =
      source.flatMap { sentence =>
        toWords(sentence).map(_ -> 1L)
      }.sumByKey(store)

The logic is exactly the same, and the code is almost the same. The main difference is that you can execute the Summingbird program in "batch mode" (using Scalding), in "realtime mode" (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.

Summingbird provides you with the primitives you need to build rock solid production systems.

Getting Started: Word Count with Twitter

The summingbird-example project allows you to run the wordcount program above on a sample of Twitter data using a local Storm topology and memcache instance. You can find the actual job definition in ExampleJob.scala.

First, make sure you have memcached installed locally. If not, if you're on OS X, you can get it by installing Homebrew and running this command in a shell:

brew install memcached

When this is finished, run the memcached command in a separate terminal.

Now you'll need to set up access to the Twitter Streaming API. This blog post has a great walkthrough, so open that page, head over to https://dev.twitter.com/ and get your various keys and tokens. Once you have these, clone the Summingbird repository:

git clone https://github.com/twitter/summingbird.git
cd summingbird

And open StormRunner.scala in your editor. Replace the dummy variables under config variable with your auth tokens:

lazy val config = new ConfigurationBuilder()
    .setOAuthConsumerKey("mykey")
    .setOAuthConsumerSecret("mysecret")
    .setOAuthAccessToken("token")
    .setOAuthAccessTokenSecret("tokensecret")
    .setJSONStoreEnabled(true) // required for JSON serialization
    .build

You're all ready to go! Now it's time to unleash Storm on your Twitter stream. Make sure the memcached terminal is still open, then start Storm from the summingbird directory:

./sbt "summingbird-example/run --local"

Storm should puke out a bunch of output, then stabilize and hang. This means that Storm is updating your local memcache instance with counts of every word that it sees in each tweet.

To query the aggregate results in Memcached, you'll need to open an SBT repl in a new terminal:

./sbt summingbird-example/console

At the launched repl, run the following:

scala> import com.twitter.summingbird.example._
import com.twitter.summingbird.example._

scala> StormRunner.lookup("i")
<memcache store loading elided>
res0: Option[Long] = Some(5)

scala> StormRunner.lookup("i")
res1: Option[Long] = Some(52)

Boom. Counts for the word "i" are growing in realtime.

See the wiki page for a more detailed explanation of the configuration required to get this job up and running and some ideas for where to go next.

Documentation

To learn more and find links to tutorials and information around the web, check out the Summingbird Wiki.

The latest ScalaDocs are hosted on Summingbird's Github Project Page.

Contact

Discussion occurs primarily on the Summingbird mailing list. Issues should be reported on the GitHub issue tracker. Simpler issues appropriate for first-time contributors looking to help out are tagged "newbie".

IRC: freenode channel #summingbird

Follow @summingbird on Twitter for updates.

Please feel free to use the beautiful Summingbird logo artwork anywhere.

Get Involved + Code of Conduct

Pull requests and bug reports are always welcome!

We use a lightweight form of project governence inspired by the one used by Apache projects. Please see Contributing and Committership for our code of conduct and our pull request review process. The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.

The current list of active committers (who can +1 a pull request) can be found here: Committers

A list of contributors to the project can be found here: Contributors

Maven

Summingbird modules are published on maven central. The current groupid and version for all modules is, respectively, "com.twitter" and 0.9.1.

Current published artifacts are

  • summingbird-core_2.11
  • summingbird-core_2.10
  • summingbird-batch_2.11
  • summingbird-batch_2.10
  • summingbird-client_2.11
  • summingbird-client_2.10
  • summingbird-storm_2.11
  • summingbird-storm_2.10
  • summingbird-scalding_2.11
  • summingbird-scalding_2.10
  • summingbird-builder_2.11
  • summingbird-builder_2.10

The suffix denotes the scala version.

Authors (alphabetically)

License

Copyright 2013 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0