Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Getting Started

whiter4bbit edited this page · 31 revisions

Contents

Getting help

Documentation

Matrix API

Third Party Modules

Videos

How-tos

Tutorials

Articles

Other

Clone this wiki locally

Preliminaries

To get started with Scalding, first clone the Scalding repository on Github:

git clone git@github.com:twitter/scalding.git

Next, build the code using sbt (a standard Scala build tool). Make sure you have Scala (download here) and sbt 0.7.4 installed (see here for instructions; Mac OS X folks using Homebrew can also check out this page to get their Scala and sbt versions correct), and run the following commands:

sbt update
sbt test
sbt package-dist
sbt assembly # creates a fat jar with all dependencies, which is useful when using the scald.rb script

Now you're good to go!

WordCount in Scalding

Let's look at a simple WordCount job.

import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {
  TextLine(args("input"))
    .read
    .flatMap('line -> 'word) { line : String => line.split("\\\\s+") }
    .groupBy('word) { _.size }
    .write(Tsv(args("output")))
}

This job reads in a file, emits every word in a line, counts the occurrences of each word, and writes these word-count pairs to a tab-separated file.

To run the job, copy the code to WordCountJob.scala, and enter the following command from the root of the Scalding repository:

scripts/scald.rb --local WordCountJob.scala --input someInputfile.txt --output ./someOutputFile.tsv

This runs the WordCount job in local mode (i.e., not on a Hadoop cluster). After a few seconds, your first Scalding job should be done!

WordCount dissection

Let's take a closer look at the job.

TextLine

TextLine is an example of a Scalding source that reads each line of a file into a field named line.

TextLine(args("input")) // args("input") contains a filename to read from
  .read

Another common source is a Tsv source that reads tab-delimited files. You can also create sources that read directly from LZO-compressed files on HDFS (possibly containing Protobuf- or Thrift-encoded objects!), or even database sources that read directly from a MySQL table.

See Scalding Sources for more information.

flatMap

flatMap is an example of a function that you can apply to a stream of tuples.

TextLine(args("input"))
  .read
  // flat map the "line" field to a new "word" field
  .flatMap('line -> 'word) { line : String => line.split("\\s+") }

First, we specify the name of the field we want to flatMap over (line, in this case), as well as the name of the output field. We then pass in a function that describes how to flat map over these fields (here, we split each line into individual words).

Our tuple stream now contains something like the following:

this is a line    this
this is a line    is
this is a line    a
this is a line    line
...

See these code snippets for more examples of flatMap (including how to flat map from and to multiple fields), as well as examples of other functions you can apply to a tuple stream.

groupBy

Next, we group the same words together, and count the size of each group.

TextLine(args("input"))
  .read
  .flatMap('line -> 'word) { line : String => line.split("\\s+") }
  .groupBy('word) { _.size } // equivalent to .groupBy('word) { group => group.size }

Here, we group the tuple stream into groups of tuples with the same word, and then add a new field with the size of each group. By default, the new field is simply called size, but we can also specify the name via _.size('numWords).

The tuple stream now looks like:

hello    5
world    3
this     1
...

Again, see these code snippets for more examples of grouping functions.

write, Tsv

Finally, just as we read from a TextLine source, we can also output our computations to a Tsv source.

TextLine(args("input"))
  .read
  .flatMap('line -> 'word) { line : String => line.split("\\s+") }
  .groupBy('word) { _.size }
  .write(Tsv(args("output")))

scald.rb

The scald.rb script in the scripts/ directory is a handy script that makes it easy to run jobs in both local mode or on a remote Hadoop cluster. It handles simple command-line parsing, and copies over necessary JAR files when running remote jobs.

If you're running many Scalding jobs, it can be useful to add scald.rb to your path, so that you don't need to provide the absolute pathname every time. One way of doing this is via (something like):

ln -s scripts/scald.rb $HOME/bin/

This creates a symlink to the scald.rb script in your $HOME/bin/ directory (which should already be included in your PATH).

See scald.rb for more information, including instructions on how to set up the script to run jobs remotely.

Next Steps

You now know the basics of Scalding! To learn more, check out the following resources:

  • tutorial/: this folder contains a tutorially series of runnable jobs.
  • Code Snippets: also under that folder is a list of code snippets explaining different kinds of Scalding functions (e.g., map, filter, project, groupBy, join).

Follow @Scalding on Twitter for updates.

Something went wrong with that request. Please try again.