Skip to content
This repository

A Scala API for Cascading

Octocat-spinner-32 logo Add scalding logo October 25, 2013
Octocat-spinner-32 maple Resolve doc generation errors in maple March 26, 2014
Octocat-spinner-32 project set parquet version to 1.4.0 April 02, 2014
Octocat-spinner-32 scalding-args refactor to have JobStats class. change command line switch to be sca… September 11, 2013
Octocat-spinner-32 scalding-avro fix ioSerializations usage in scalding 0.9 April 09, 2014
Octocat-spinner-32 scalding-commons Remove answered TODO question from VersionedKeyValSource March 13, 2014
Octocat-spinner-32 scalding-core Merge pull request #844 from twitter/jco/typed_addTrap April 14, 2014
Octocat-spinner-32 scalding-date Remove Java Commons DateUtils dependency from Scalding November 23, 2013
Octocat-spinner-32 scalding-jdbc use latest bugfix release of cascading-jdbc-core by default February 17, 2014
Octocat-spinner-32 scalding-json nicer pattern matching for jsonline December 11, 2013
Octocat-spinner-32 scalding-parquet multiple paths for ParquetTuple source, add companion object to const… February 07, 2014
Octocat-spinner-32 scalding-repl Remove vestigial uses of .readAtSubmitter March 03, 2014
Octocat-spinner-32 scripts Update --print-cp to include user-selected modules. February 28, 2014
Octocat-spinner-32 tutorial Give a better display command. February 10, 2014
Octocat-spinner-32 .gitignore typos April 04, 2014
Octocat-spinner-32 .travis.yml Bump to oracle's jdk7 to see if the build can be better behaved in as… November 20, 2013
Octocat-spinner-32 CHANGES.md renamed * to dotProd and changed the types a bit August 22, 2013
Octocat-spinner-32 CONTRIBUTING.md Add CONTRIBUTING.md September 19, 2012
Octocat-spinner-32 LICENSE Initial Import January 10, 2012
Octocat-spinner-32 NOTICE Initial Import January 10, 2012
Octocat-spinner-32 README.md Bump version to 0.9.1 April 02, 2014
Octocat-spinner-32 sbt Upgrades to sbt 0.13 November 20, 2013
Octocat-spinner-32 version.sbt Bump version to 0.9.1 April 02, 2014
README.md

Scalding

Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.

Scalding Logo

Current version: 0.9.1

Word Count

Hadoop is a distributed system for counting words. Here is how it's done in Scalding.

package com.twitter.scalding.examples

import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {
  TextLine( args("input") )
    .flatMap('line -> 'word) { line : String => tokenize(line) }
    .groupBy('word) { _.size }
    .write( Tsv( args("output") ) )

  // Split a piece of text into individual words.
  def tokenize(text : String) : Array[String] = {
    // Lowercase each word and remove punctuation.
    text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
  }
}

Notice that the tokenize function, which is standard Scala, integrates naturally with the rest of the MapReduce job. This is a very powerful feature of Scalding. (Compare it to the use of UDFs in Pig.)

You can find more example code under examples/. If you're interested in comparing Scalding to other languages, see our Rosetta Code page, which has several MapReduce tasks in Scalding and other frameworks (e.g., Pig and Hadoop Streaming).

Documentation and Getting Started

Please feel free to use the beautiful Scalding logo artwork anywhere.

Building

There is a script (called sbt) in the root that loads the correct sbt version to build:

  1. ./sbt update (takes 2 minutes or more)
  2. ./sbt test
  3. ./sbt assembly (needed to make the jar used by the scald.rb script)

The test suite takes a while to run. When you're in sbt, here's a shortcut to run just one test:

> test-only com.twitter.scalding.FileSourceTest

Please refer to FAQ page if you encounter problems when using sbt.

We use Travis CI to verify the build: Build Status

Scalding modules are available from maven central.

The current groupid and version for all modules is, respectively, "com.twitter" and 0.8.11.

Current published artifacts are

  • scalding-core_2.9.2
  • scalding-core_2.10
  • scalding-args_2.9.2
  • scalding-args_2.10
  • scalding-date_2.9.2
  • scalding-date_2.10
  • scalding-commons_2.9.2
  • scalding-commons_2.10
  • scalding-avro_2.9.2
  • scalding-avro_2.10

The suffix denotes the scala version.

Adopters

  • Ebay
  • Etsy
  • Sharethrough
  • Snowplow Analytics
  • Soundcloud
  • Twitter

To see a full list of users or to add yourself, see the wiki

Contact

For user questions, we are using the cascading-user mailing list for discussions: http://groups.google.com/group/cascading-user

For scalding development (internals, extending, release planning): https://groups.google.com/forum/#!forum/scalding-dev

In the remote possibility that there exist bugs in this code, please report them to: https://github.com/twitter/scalding/issues

Follow @Scalding on Twitter for updates.

Chat (IRC): freenode channel: #scalding

Authors:

Thanks for assistance and contributions:

A full list of contributors can be found on GitHub.

License

Copyright 2013 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Something went wrong with that request. Please try again.