Skip to content
Newer
Older
100644 101 lines (77 sloc) 4.37 KB
790a1aa @johnynek Initial Import
johnynek authored
1 # Scalding
2 Scalding is a library that has two components:
3
4 * a scala DSL to make map-reduce computations look very similar to scala's collection API
5 * a wrapper to Cascading to make simpler to define the usual use cases of jobs, tests and describing new data on HDFS.
6
7 To run scala scalding jobs, a script, scald.rb is provided in scripts/. Run this script
8 with no arguments to see usage tips. You will need to customize the default variables
9 at the head of that script for your environment.
10
11 You should follow the scalding project on twitter: <http://twitter.com/scalding>
12
13 ## Word Count
14 Hadoop is a distributed system for counting words. Here is how it's done in scalding. You can find this in examples:
15
6949568 @azymnis Fix indentation in example.
azymnis authored
16 ```scala
17 package com.twitter.scalding.examples
18
19 import com.twitter.scalding._
20
21 class WordCountJob(args : Args) extends Job(args) {
22 TextLine( args("input") ).read.
23 flatMap('line -> 'word) { line : String => line.split("\\s+") }.
24 groupBy('word) { _.size }.
25 write( Tsv( args("output") ) )
26 }
27 ```
790a1aa @johnynek Initial Import
johnynek authored
28
29 ##Tutorial
30 See tutorial/ for examples of how to use the DSL. See tutorial/CodeSnippets.md for some
31 example scalding snippets.
32
33 ## Building
09514e1 @johnynek Updates the readme to be correct for sbt 0.11
johnynek authored
34 0. Install sbt 0.11
26f00d8 @azymnis Fix up markup of README file.
azymnis authored
35 1. ```sbt update``` (takes 2 minutes or more)
36 2. ```sbt test```
09514e1 @johnynek Updates the readme to be correct for sbt 0.11
johnynek authored
37 3. ```sbt assembly``` (needed to make the jar used by the scald.rb script)
790a1aa @johnynek Initial Import
johnynek authored
38
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
39 ## Comparison to Scrunch/Scoobi
40 Scalding comes with an executable tutorial set that does not require a Hadoop
41 cluster. If you're curious about scalding, why not invest a bit of time and run the tutorial
42 yourself and make your own judgement.
43
44 Scalding was developed before either of those projects
45 were announced publicly and has been used in production at Twitter for more than six months
46 (though it has been through a few iterations internally).
47 The main difference between Scalding (and Cascading) and Scrunch/Scoobi is that Cascading has
48 a record model where each element in your distributed list/table is a table with some named
49 fields. This is nice because most common cases are to have a few primitive columns (ints, strings,
50 etc...). This is discussed in detail in the two answers to the following question:
51 <http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading>
52
53 Scoobi and Scrunch stress types and do not
8202733 @johnynek Minor fixup of README.md
johnynek authored
54 use field names to build ad-hoc record types. Cascading's fields are very convenient,
55 and our users have been very productive with Scalding. Fields do present problems for
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
56 type inference because Cascading cannot tell you the type of the data in Fields("user_id", "clicks")
57 at compile time. This could be surmounted by building a record system in scala that
58 allows the programmer to express the types of the fields, but the cost of this is not trivial,
59 and the win is not so clear.
60
8202733 @johnynek Minor fixup of README.md
johnynek authored
61 Scalding supports using any scala object in your map/reduce operations using Kryo serialization,
62 including scala Lists, Sets,
dde7293 @espringe Update scoobi comparison in README
espringe authored
63 Maps, Tuples, etc. It is not clear that such transparent serialization is present yet in
8202733 @johnynek Minor fixup of README.md
johnynek authored
64 scrunch. Like Scoobi, Scalding has a form of MSCR fusion by relying on Cascading's AggregateBy
dde7293 @espringe Update scoobi comparison in README
espringe authored
65 operations. Our Reduce primitives (see GroupBuilder.reduce and .mapReduceMap) are comparable to
66 Scoobi's combine primitive, which by default uses Hadoop combiners on the map side.
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
67
68 Lastly, Scalding comes with a script that allows you to write a single file and run that
69 single file locally or on your Hadoop cluster by typing one line "scald.rb [--local] myJob.scala".
70 It is really convenient to use the same language/tool to run jobs on Hadoop and then to post-process
71 the output locally.
72
790a1aa @johnynek Initial Import
johnynek authored
73 ## Mailing list
74
75 Currently we are using the cascading-user mailing list for discussions.
76 <http://groups.google.com/group/cascading-user>
77
78 Follow the scalding project on twitter for updates: <http://twitter.com/scalding>
79
80 ## Bugs
81 In the remote possibility that there exist bugs in this code, please report them to:
82 <https://github.com/twitter/scalding/issues>
83
84 ## Authors:
85 * Avi Bryant <http://twitter.com/avibryant>
86 * Oscar Boykin <http://twitter.com/posco>
87 * Argyris Zymnis <http://twitter.com/argyris>
88
89 Thanks for assistance and contributions:
90
91 * Chris Wensel <http://twitter.com/cwensel>
92 * Ning Liang <http://twitter.com/ningliang>
93 * Dmitriy Ryaboy <http://twitter.com/squarecog>
94 * Dong Wang <http://twitter.com/dongwang218>
95 * Edwin Chen <http://twitter.com/edchedch>
96
97 ## License
98 Copyright 2012 Twitter, Inc.
99
b9a5950 @johnynek Mention in README difference to scoobi/scrunch
johnynek authored
100 Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0
Something went wrong with that request. Please try again.