Skip to content
This repository

Contents

Getting help

Documentation

Matrix API

Third Party Modules

Videos

How-tos

Tutorials

Articles

Other

There are two main concepts in the type-safe API: a TypedPipe[T] which is kind of a distributed list of objects of type T and a KeyedList[K,V] which represents some sharding of objects of key K and value V. There are two KeyedList objects: Grouped[K,V] and subclasses of CoGrouped. The former represents usual groupings, and the latter is used for cogroupings or joins.

Generally, all the methods from the Fields-based API Reference are present with the following exceptions:
1. The mapping functions always replace the input with the output. map and flatMap in the Type safe API are similar to the mapTo and flatMapTo functions (respectively) in the Fields-based API. 2. Due to the previous statement, there is no need to name fields.

Chapters:

Interoperating between Fields API and Type-safe API

If you import TDsl._ you get an enrichment on cascading Pipe objects to jump into a Typed block:

  pipe.typed(('in0, 'in1) -> 'out) { tpipe : TypedPipe[(Int,Int)] =>
     tpipe.groupBy { x => 1 } //groups on all the input tuples (equivalent to groupAll)
     .mapValues { tup => tup._1 + tup._2 } //sum the two values in each tuple
     .sum //sum all the tuple sums (i.e. sum everything)
     .values // discard the key which is 1
  }

In this example, we start off with a cascading Pipe (pipe), which has the 'in0 and 'in1 fields. We use the method typed in order to create a new TypedPipe (tpipe). Then, we apply all of our functions on the TypedPipe[(Int, Int)] to obtain a TypedPipe[Int] which has the total sum. Finally, this is converted back into the cascading Pipe (pipe) with the single field 'out, which contains a single Tuple holding the total sum.

Converting pipes

  • To go from a pipe to a TypedPipe[T]: mypipe.toTypedPipe[T](Fields_Kept). Fields_Kept specifies the fields in mypipe that we want to keep in the Typed Pipe.
  • To go from a TypedPipe[T] to a pipe: myTypedPipe.toPipe(f: Fields) method. Since we go from a Typed to a cascading pipe, we actually need to give names to the fields.

Example:

import TDsl._

case class Bird(name : String, winLb : Float, color : String)
val birds : TypedPipe[Bird] = getBirdPipe
birds.toPipe('name, 'winLb, 'color) //Cascading Pipe with the 3 specified fields.
birds.toTypedPipe[String, String]('name, 'color) //Typed Pipe (keeping only some fields)

Advanced examples:

import TDsl._

case class Bird(name : String, winLb : Float, hinFt : Float, color : String)
val birds : TypedPipe[Bird] = getBirdPipe
birds.toPipe('name, 'color)

val p : TypedPipe[(Double, Double)] = TypedTsv[(Double,Double)](input, ('a, 'b)).toTypedPipe[(Double, Double)]('a, 'b)

TypedPipe[MyClass] is slightly more involved, but you can get it in several ways. One straightforward way is:

object Bird {
  def fromTuple(t : (Double, Double)) : Bird = Bird(t._1, t._2)
}

case class Bird(weight : Double, height : Double) {
  def toTuple : (Double, Double) = { (weight, height) }
}

import TDsl._
val birds : TypedPipe[Bird] = TypedTsv[(Double, Double)](path, ('weight, 'height)).map{ Bird.fromTuple(_) }

Map-like functions

map

def map[U](f : T => U) : TypedPipe[U]

Converts a TypedPipe[T] to a TypedPipe[U] via f : T => U

case class Bird(name : String, winLb : Float, hinFt : Float, color : String)
val birds : TypedPipe[Bird] = getBirdPipe
val britishBirds : TypedPipe[(Float, Float)] =
  birds.map { bird =>
    val (weightInLbs, heightInFt) = (bird.winLb, bird.hinFt)
    (0.454 * weightInLbs, 0.305 * heightInFt)
  }

flatMap

def flatMap[U](f : T => Iterable[U]) : TypedPipe[U]

Converts a TypedPipe[T] to a TypedPipe[U] by applying f : T => Iterable[U] followed by flattening.

case class Book(title : String, author : String, text : String)
val books : TypedPipe[Book] = getBooks
val words : TypedPipe[String] = books.flatMap { _.text.split("\s+") }
//Example that uses Option. (either an animal name passes by in the pipe or nothing)
animals.flatMap { an =>
  if (an.kind == "bird") {Some(an.name)} else None
}

filter, filterNot

Filters out rows.

filter

def filter(f : T => Boolean) : TypedPipe[T]

If you return true you keep the row, otherwise the row is ignored.

case class Animal(name : String, kind: String)
val animals : TypedPipe[Animal] = getAnimals
val birds = animals.filter { _.kind == "bird" }

filterNot

def filterNot(f : T => Boolean) : TypedPipe[T]

Acts like filter with a negated predicate - keeps the rows where the predicate function returns false, otherwise the row is ignored.

case class Animal(name : String, kind: String)
val animals : TypedPipe[Animal] = getAnimals
val notBirds = animals.filterNot { _.kind == "bird" }

collect

def collect(f : PartialFunction[T,U]): TypedPipe[U]

Filters and maps with Scala's partial function syntax (case):

case class Animal(name : String, kind: String)
val animals : TypedPipe[Animal] = getAnimals
val birds: TypedPipe[String] = animals.collect { case Animal(name, "bird") => name }
//This is the same as flatMapping an Option.

Creating Groups and Joining (CoGrouping)

Grouping

These are all methods on TypedPipe[T]. Notice that these methods do not return a TypedPipe[T] anymore; instead, they return Grouped[K,T].

groupBy

def groupBy[K](g : (T => K))(implicit  ord : Ordering[K]) : Grouped[K,T]

Call g : T => K on a TypedPipe[T] to create a Grouped[K,T]. Subsequent aggregation methods use K as the type of the grouping key. We can use any of the functions on Groups specified on the Fields Api to transform the Grouped[K, T] to a TypedPipe[U]. Notice that those functions act on T.

Groups need an Ordering (i.e. a comparator) for the key K that we are grouping by. This is implemented for all the standard variable types that we use, in which case no explicit declaration is necessary.

case class Book(title : String, author : String, year : Int)
val books : TypedPipe[Book] = getBooks
val fields : TypedPipe[String, String, Int] = books.map { b : Book => (b.title, b.author, b.year) }
//We want to group all the books based on their author
fields.groupBy{ book_info : (String, String, Int) => book_info._2 }
//Now, we have Grouped[String*, (String, String, Int)**].
//*This String corresponds to the author.
//**This Tuple corresponds to the original fields found on the TypedPipe named fields.
fields.size
//This creates a TypedPipe[String, Int], where the String corresponds to the author and the Int corresponds to the number of books that the author wrote.

group (implicit grouping)

def group[K,V](implicit ev : =:=[T,(K,V)], ord : Ordering[K]) : Grouped[K,V]

Special case of groupBy that can be called on TypedPipe[(K,V)]. Uses K as the grouping key.

groupAll (send everything to one reducer)

def groupAll : Grouped[Unit,T]

Uses Unit as the grouping key. Useful to send all tuples to 1 reducer.

Useful functions on Grouped[K,V].

val group: Grouped[K, V]
group.keys 
//Creates a TypedPipe[K] consisting of the keys in the (key, value) pairs of group.
group.values 
//Creates a TypedPipe[V] consisting of the values in the (key, value) pairs of group.
group.mapValues{values:V => mapingFuncion(values)} 
//Creates a Grouped[K, V'], where the keys in the (key, value) pairs of group are unchanged, but the values are changed to V'.

Joining/CoGrouping

General information about Joins can be found at the Fields API documentation, in the Joins section.

These are all methods on Grouped[K,V]. Always, put the smaller grouping on the right; this improves performance, but correctness is not impacted. First, we group the pipe by key of type K to get Grouped[K, V]. Then, we join with another group of the same key K, for example Grouped[K, W].

join (inner-join)

def join[W](smaller : Grouped[K,W]) : CoGrouped2[K,V,W,(V,W)]

We already know K and V. The only type that could be specified in the join function is W, which is the value in the key-valued group of the smaller group.

//We have two libraries and we want to get a list of the books they have in common.
//The books of Library 2 have an additional field "copies."
case class Book(title : String, year : Int)
case class Extended_Book(title : String, year : Int, copies : Long)
//Group the books of Library 1 by book title.
val library1 : TypedPipe[Book] = getBooks1
val L1 : TypedPipe[String, Int] = library1.map { b : Book => (b.title, b.year) }
val group1 = L1.groupBy{ book_info : (String, Int) => book_info._1 }
//Similarly, group the books of Library 2 by book title.
val library2 : TypedPipe[Extended_Book] = getBooks2
val L2 : TypedPipe[String, Int, Long] = library2.map { b : Extended_Book => (b.title, b.year, b.copies) }
val group2 = L2.groupBy{ book_info : (String, Int, Long) => book_info._1 }

//If Library 1 is larger than Library 2, then we do:
val TheJoin : CoGrouped2[String, (String, Int), (String, Int, Long), ((String, Int), (String, Int, Long))] = group1.join(group2)

leftJoin

def leftJoin[W](smaller : Grouped[K,W]) : CoGrouped2[K,V,W,(V,Option[W])]

Using the definitions from the previous example, assume you are the general manager of Library 1 and you are interested in a complete list of all the books in your library. In addition, you would like to know, which of those books can also be found in Library 2, in case the ones in your library are being used:

val TheLeftJoin : CoGrouped2[String, (String, Int), (String, Int, Long), ((String, Int), Option(String, Int, Long))] = group1.leftJoin(group2)

rightJoin

def rightJoin[W](smaller : Grouped[K,W]) : CoGrouped2[K,V,W,(Option[V],W)]

outerJoin

def outerJoin[W](smaller : Grouped[K,W]) : CoGrouped2[K,V,W,(Option[V],Option[W])]

The most useful function of CoGrouped2 is toTypedPipe.

val myJoin : CoGrouped2[K,V,W,(V,W)]
val tpipe : TypedPipe[K,(V,W)] = myJoin.toTypedPipe

Map-side (replicated) joins:

These methods do not require a reduce step, but should only be used on extremely small arguments since each mapper will read the entire argument to do the join.

cross

Suppose we want to send every value from one TypedPipe[U] to each value of a TypedPipe[T]. List(1,2,3) cross List(4,5) gives List((1,4),(1,5),(2,4),(2,5),(3,4),(3.5)). The final size is left.size * right.size.

// Implements a cross product.  The right side should be tiny.
def cross[U](tiny : TypedPipe[U]) : TypedPipe[(T,U)]

hashJoin

A very efficient join, which works when the right side is tiny, is hashJoin. All the (key, value) pairs from the right side are stored in a hash table for quick retrieval. The hash table is replicated on every mapper and the hashJoin operation takes place entirely on the mappers (no reducers involved).

// Again, the right side should be tiny.
def hashJoin[W](tiny : Grouped[K,W]) : TypedPipe[(K,(V,W))]

ValuePipe: working with values that will be computed

Sometimes we reduce everything down to one value:

val userFollowers: TypedPipe[(Long, Int)] = // function to get 

val topUsers: TypedPipe[Long] = allUsers
  .collect { case (uid, followers) if followers > 1000000 => uid }

// put it in a value:
val topUsers: ValuePipe[Set[Long]] = topUsers.map(Set(_)).sum

A value Pipe is a kind of future value: it is a value that will be computed by your job, but is not there yet. TypedPipe.sum returns a ValuePipe.

When you have this, you can then use it on another TypedPipe:

val allClickers: TypedPipe[Long] = //...

val topClickers = allClickers.filterWithValue(topUsers) { (clicker, optSet) =>
  optSet.get.contains(clicker) // keep the topUsers that are also clickers
}

You can also mapWithValue or flatMapWithValue. See ValuePipe.scala for more.

Records

Suppose you have many fields and you want to update just one or two. Did you know about the copy method on all case classes?

Consider this example:

scala> case class Record(name: String, weight: Double)
defined class Record

scala> List(Record("Bob", 180.3), Record("Lisa", 154.3))
res22: List[Record] = List(Record(Bob,180.3), Record(Lisa,154.3))

scala> List(Record("Bob", 180.3), Record("Lisa", 154.3)).map { r =>
  val w = r.weight + 10.0
  r.copy(weight = w)
}
res23: List[Record] = List(Record(Bob,190.3), Record(Lisa,164.3))

In exactly the same way, you can update just one or two fields in a case class on scalding with the typed API.

This is how we recommend making records, but WATCH OUT: you need to define case classes OUTSIDE of your job due to serialization reasons (otherwise they create circular references).

Aggregation and Stream Processing

Both Grouped[K,R] and CoGrouped[K,V,W,R] extend KeyedList[K,R], which is the class that represents sublists of R sharded by K and on which you can do reduce or entire stream processing via mapValueStream. If you can prefer reduce or methods implemented via reduce which can in general be partially performed on the mappers before passing to the reducers. Using mapValueStream always forces all the data to reducers. Realizing the entire stream of values at once (i.e. manually reversing or rescanning the data) can explode the memory.

The Fields-based API Reference has a builder-pattern object called GroupBuilder which allows you to easily create a tuple of several parallel aggregations, e.g. counting, summing, and taking the max, all in one pass through the data. The type-safe API has a way to do this, but it involves implementing a type-class for your object and using KeyedList.sum on the tuple. Below we give an example.

case class MyObject(name : String, ridesFixie : Boolean, rimmedGlasses : Boolean, income : Double) {
  def hipsterScore = { List(ridesFixie, rimmedGlasses).map { if(_) 1.0 else 0.0 }.sum + 1.0/income }
}

// Monoid which chooses the highest hipster score
implicit val hipsterMonoid = new Monoid[MyObject] {
  def zero = MyObject("zeroHipster", false, false, Double.PositiveInfinity)
  def plus(left : MyObject, right : MyObject) = { List(left, right).maxBy { _.hipsterScore } }
}

// Now let's count our fixie riders find the biggest hipster
val everybody : TypedPipe[MyObject] = getPeople
// Now we want to know how many total people, fixie riders, and how many rimmed-glasses wearers,
// as well as the biggest hipster:
everybody.map { person => (1L, if(person.ridesFixie) 1L else 0L, if(person.rimmedGlasses) 1L else 0L, person) }
  .groupAll
  .sum
  //Throw away the unit key created by groupAll
  .values
  // annoying, but mapping a single part in the tuple is a pain (here we want .name)
  .map { results => (results._1, results._2, results._3, results._4.name)
  .write(TypedTsv[(Long, Long, Long, String)]("maxHipster"))

Scalding automatically knows how to sum tuples (it does so element-wise, see GeneratedAbstractAlgebra.scala.

Powerful Aggregation with Algebird

See this example on Locality Sensitive Hashing via @argyris.

Something went wrong with that request. Please try again.