Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Optimizations on the Producer Graph

Ian O'Connell edited this page Feb 27, 2014 · 4 revisions

Existing optimizations:

  1. optionMap pushed up to sources.
  2. flatMap/optionMap composition
  3. map-side caching and combining in flatMappers immediately before sumByKey
  4. batching of results from flatMaps prior to sumByKey to send Map[K, V] rather than many individual (K, V) pairs (this reduces thread contention in storm).
  5. flatMapKeys followed by sumByKey causes a partial sum of the values before the flatMap, when the keys should have better locality (scalding only right now).

Possible future optimizations

  1. Diamond collapse: (a.flatMap(fn1) ++ a.flatMap(fn2)) could be collapsed into a single flatMap (similarly with optionMap).
  2. flatMapKeys partial sum in online mode.
  3. leftJoin -> mapValues -> sumByKey could be pushed to the summer so that keys are colocated before hitting the service. In scalding, this is almost certainly a win, online when local caches are used, it may give better locality.
  4. push flatMap/leftJoin up to sources
  5. online, if a store is a converted store, go ahead and do the key Injection on the flatMappers before the sumByKey (i.e. we are often converting an object to bytes before putting into memcache, why not go ahead and serialize once, and do the cheap bytes-only serialization between the flatMappers and reducers.
  6. Offline: Frozen keys, we have some code/PR's on it, but needs to be finished up tested.
  7. Online: Full batching between Spouts, Flatmaps and Summers in online mode
  8. Online: Collapse final flat map's into spouts when we have enough parallelism
  9. Online: Online tuning to auto move operations in/out of a spout
  10. Online: See we have a CAS store and collapse the whole tree into the spouts/sources