Rosetta Code

A collection of MapReduce tasks translated (from Pig, Hive, MapReduce streaming, etc.) into Scalding. For fully runnable code, see the repository here.

Word Count

Cascalog

(defmapcatop tokenize [text]
  (seq (.split text "\\s+")))

(defn word-count [input-filename]
	(let [input (hfs-textline input-filename)]
    (<- [?word ?count]
        (input ?textline)
        (tokenize ?textline :> ?word)
        (c/count ?count))))  

(?- (stdout) (word-count "tweets.tsv"))

Hive

# tokenizer.py
import sys

for line in sys.stdin:
  for word in line.split():
    print word

CREATE TABLE tweets (text STRING);
LOAD DATA LOCAL INPATH 'tweets.tsv' OVERWRITE INTO TABLE tweets;

SELECT
	word, COUNT(*) AS count
FROM
	(
		SELECT TRANSFORM(text) 
			USING 'python tokenizer.py' 
			AS word
		FROM tweets
	) t
GROUP BY word;

Pig

tweets = LOAD 'tweets.tsv' AS (text:chararray);
words = FOREACH tweets GENERATE FLATTEN(TOKENIZE(text)) AS word;
word_groups = GROUP words BY word;
word_counts = FOREACH word_groups GENERATE group AS word, COUNT(words) AS count;

STORE word_counts INTO 'word_counts.tsv';

Scalding

Tsv("tweets.tsv", 'text)
  .flatMap('text -> 'word) { text : String => text.split("\\s+") }
  .groupBy('word) { _.size }
  .write(Tsv("word_counts.tsv"))

Distributed Grep

The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. (Example taken from Google Code University.)

Pig

%declare PATTERN '.*hello.*';

tweets = LOAD 'tweets.tsv' AS (text:chararray);
results = FILTER tweets BY (text MATCHES '$PATTERN');

Scalding

val Pattern = ".*hello.*";

Tsv("tweets.tsv", 'text)
  .filter('text) { text : String => text.matches(Pattern) }

Inverted Index

The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. (Example taken from Google Code University.)

Pig

tweets = LOAD 'tweets.tsv' AS (tweet_id:int, text:chararray);

words = FOREACH tweets GENERATE tweet_id, FLATTEN(TOKENIZE(text)) AS word;
word_groups = GROUP words BY word;
inverted_index = FOREACH word_groups GENERATE group AS word, words.tweet_id;

Scalding

val tweets = Tsv("tweets.tsv", ('id, 'text))

val wordToTweets =
  tweets
    .flatMap(('id, 'text) -> ('word, 'tweetId)) { 
      fields : (Long, String) => 
      val (tweetId, text) = fields
      text.split("\\s+").map { word => (word, tweetId) }
    }

val invertedIndex =
  wordToTweets.groupBy('word) {  _.toList[Long]('tweetId -> 'tweetIds) }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rosetta Code

Word Count

Cascalog

Hive

Pig

Scalding

Distributed Grep

Pig

Scalding

Inverted Index

Pig

Scalding

Contents

Getting help

Documentation

Matrix API

Third Party Modules

Videos

How-tos

Tutorials

Articles

Other

Clone this wiki locally