Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 106 lines (81 sloc) 4.784 kb
790a1aa @johnynek Initial Import
johnynek authored
1 # Scalding
2 Scalding is a library that has two components:
3
4 * a scala DSL to make map-reduce computations look very similar to scala's collection API
5 * a wrapper to Cascading to make simpler to define the usual use cases of jobs, tests and describing new data on HDFS.
6
7 To run scala scalding jobs, a script, scald.rb is provided in scripts/. Run this script
8 with no arguments to see usage tips. You will need to customize the default variables
9 at the head of that script for your environment.
10
11 You should follow the scalding project on twitter: <http://twitter.com/scalding>
12
13 ## Word Count
14 Hadoop is a distributed system for counting words. Here is how it's done in scalding. You can find this in examples:
15
6949568 @azymnis Fix indentation in example.
azymnis authored
16 ```scala
be05809 @azymnis Some more markdown fixups.
azymnis authored
17 package com.twitter.scalding.examples
6949568 @azymnis Fix indentation in example.
azymnis authored
18
be05809 @azymnis Some more markdown fixups.
azymnis authored
19 import com.twitter.scalding._
6949568 @azymnis Fix indentation in example.
azymnis authored
20
be05809 @azymnis Some more markdown fixups.
azymnis authored
21 class WordCountJob(args : Args) extends Job(args) {
22 TextLine( args("input") ).read.
23 flatMap('line -> 'word) { line : String => line.split("\\s+") }.
24 groupBy('word) { _.size }.
25 write( Tsv( args("output") ) )
26 }
6949568 @azymnis Fix indentation in example.
azymnis authored
27 ```
790a1aa @johnynek Initial Import
johnynek authored
28
29 ##Tutorial
30 See tutorial/ for examples of how to use the DSL. See tutorial/CodeSnippets.md for some
c0817de @johnynek Adds Travis support, updates scald.rb for version 0.3.1
johnynek authored
31 example scalding snippets. Edwin Chen wrote an excellent tutorial on using scalding for
32 recommendations:
33 <http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/>
790a1aa @johnynek Initial Import
johnynek authored
34
35 ## Building
09514e1 @johnynek Updates the readme to be correct for sbt 0.11
johnynek authored
36 0. Install sbt 0.11
26f00d8 @azymnis Fix up markup of README file.
azymnis authored
37 1. ```sbt update``` (takes 2 minutes or more)
38 2. ```sbt test```
09514e1 @johnynek Updates the readme to be correct for sbt 0.11
johnynek authored
39 3. ```sbt assembly``` (needed to make the jar used by the scald.rb script)
790a1aa @johnynek Initial Import
johnynek authored
40
c0817de @johnynek Adds Travis support, updates scald.rb for version 0.3.1
johnynek authored
41 We use Travis-ci.org to verify the build:
2a13558 @johnynek Fix Travis link in README.md
johnynek authored
42 [![Build Status](https://secure.travis-ci.org/twitter/scalding.png)](http://travis-ci.org/twitter/scalding)
c0817de @johnynek Adds Travis support, updates scald.rb for version 0.3.1
johnynek authored
43
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
44 ## Comparison to Scrunch/Scoobi
45 Scalding comes with an executable tutorial set that does not require a Hadoop
46 cluster. If you're curious about scalding, why not invest a bit of time and run the tutorial
47 yourself and make your own judgement.
48
49 Scalding was developed before either of those projects
50 were announced publicly and has been used in production at Twitter for more than six months
51 (though it has been through a few iterations internally).
52 The main difference between Scalding (and Cascading) and Scrunch/Scoobi is that Cascading has
53 a record model where each element in your distributed list/table is a table with some named
54 fields. This is nice because most common cases are to have a few primitive columns (ints, strings,
55 etc...). This is discussed in detail in the two answers to the following question:
56 <http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading>
57
58 Scoobi and Scrunch stress types and do not
8202733 @johnynek Minor fixup of README.md
johnynek authored
59 use field names to build ad-hoc record types. Cascading's fields are very convenient,
60 and our users have been very productive with Scalding. Fields do present problems for
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
61 type inference because Cascading cannot tell you the type of the data in Fields("user_id", "clicks")
62 at compile time. This could be surmounted by building a record system in scala that
63 allows the programmer to express the types of the fields, but the cost of this is not trivial,
64 and the win is not so clear.
65
8202733 @johnynek Minor fixup of README.md
johnynek authored
66 Scalding supports using any scala object in your map/reduce operations using Kryo serialization,
67 including scala Lists, Sets,
dde7293 @espringe Update scoobi comparison in README
espringe authored
68 Maps, Tuples, etc. It is not clear that such transparent serialization is present yet in
8202733 @johnynek Minor fixup of README.md
johnynek authored
69 scrunch. Like Scoobi, Scalding has a form of MSCR fusion by relying on Cascading's AggregateBy
dde7293 @espringe Update scoobi comparison in README
espringe authored
70 operations. Our Reduce primitives (see GroupBuilder.reduce and .mapReduceMap) are comparable to
71 Scoobi's combine primitive, which by default uses Hadoop combiners on the map side.
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
72
73 Lastly, Scalding comes with a script that allows you to write a single file and run that
74 single file locally or on your Hadoop cluster by typing one line "scald.rb [--local] myJob.scala".
75 It is really convenient to use the same language/tool to run jobs on Hadoop and then to post-process
76 the output locally.
77
790a1aa @johnynek Initial Import
johnynek authored
78 ## Mailing list
79
80 Currently we are using the cascading-user mailing list for discussions.
81 <http://groups.google.com/group/cascading-user>
82
83 Follow the scalding project on twitter for updates: <http://twitter.com/scalding>
84
85 ## Bugs
86 In the remote possibility that there exist bugs in this code, please report them to:
87 <https://github.com/twitter/scalding/issues>
88
89 ## Authors:
90 * Avi Bryant <http://twitter.com/avibryant>
91 * Oscar Boykin <http://twitter.com/posco>
92 * Argyris Zymnis <http://twitter.com/argyris>
93
94 Thanks for assistance and contributions:
95
96 * Chris Wensel <http://twitter.com/cwensel>
97 * Ning Liang <http://twitter.com/ningliang>
98 * Dmitriy Ryaboy <http://twitter.com/squarecog>
99 * Dong Wang <http://twitter.com/dongwang218>
100 * Edwin Chen <http://twitter.com/edchedch>
101
102 ## License
103 Copyright 2012 Twitter, Inc.
104
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
105 Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0
Something went wrong with that request. Please try again.