Cache Execution evaluations #1057

johnynek · 2014-09-18T05:13:53Z

johnynek · 2014-09-18T05:15:48Z

@egonina Want to take a look? I think this fixes the issue you noted today. This is aggressively caching everything, which is probably appropriate for map/reduce. But if someone gets the idea to set up a server running in an infinite recursive flatMap loop, this is going to OOM. I think we can cross that bridge if we ever get a bug report.

egonina · 2014-09-18T17:50:44Z

scalding-core/src/main/scala/com/twitter/scalding/Execution.scala

+     * error to add something to the cache twice clearly).
+     */
+    def getOrElseInsert[T](ex: Execution[T],
+      res: => (EvalCache, Future[(T, ExecutionCounters, EvalCache)]))


Can you explain what the res function parameter is doing? Is it the result of the execution, in which case, should it also be a Future?

I am using the lazy parameter (: =>) in the same way a mutable map does with .getOrElseUpdate. So, if we have not already evaluated ex, put the code to evaluate it there. Now you can actually get the future into the cache before it is complete and so parallel operations (such as the zip) will not both recompute parents they need. That is why we return both the cache BEFORE the future is complete (BUT containing this execution) and after it is complete, as the flatMap may cause more Executions to be evaluated along the way.

If we only had the cache after the future, zips would be made into a sequence, and the case of a diamond (b and c depend on a and b.zip(c) is what you are evaluating), we would still serialize. Here, we will get a into the cache when we start return from b, and then c can look at the cache now, and see a, and only wait for a, not b.

Would zips made into a sequence here because in case of b.zip(c) c considers b to be a parent?

b.zip(c) should generally not have b being a parent of c, but it could be. Imagine: b.zip(b.map(fn)). we would first schedule the execution of b, then when scheduling c = b.map(fn) we would find b in the cache, and would reuse the future there.

egonina · 2014-09-18T21:21:19Z

Have you tested this?

johnynek · 2014-09-18T21:39:40Z

I have tested it, in that, the current kmeans test runs and exercises it some (and the test runs much faster). But I am working on adding some more comprehensive tests.

Cache Execution evaluations

Cache Execution evaluations

db853ed

egonina reviewed Sep 18, 2014
View reviewed changes

Add a test for diamond Executions

98e3890

egonina added a commit that referenced this pull request Sep 19, 2014

Merge pull request #1057 from twitter/cache-execution

cafc7d9

Cache Execution evaluations

egonina merged commit cafc7d9 into develop Sep 19, 2014

egonina deleted the cache-execution branch September 19, 2014 21:04

stephanh mentioned this pull request Oct 27, 2014

Execution Monad causes reprocessing #1077

Closed

vineethvarghese mentioned this pull request Dec 17, 2015

Scalding Execution API memory leak / OOM #1459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Execution evaluations #1057

Cache Execution evaluations #1057

johnynek commented Sep 18, 2014

johnynek commented Sep 18, 2014

egonina Sep 18, 2014

johnynek Sep 18, 2014

egonina Sep 18, 2014

johnynek Sep 18, 2014

egonina commented Sep 18, 2014

johnynek commented Sep 18, 2014

Cache Execution evaluations #1057

Cache Execution evaluations #1057

Conversation

johnynek commented Sep 18, 2014

johnynek commented Sep 18, 2014

egonina Sep 18, 2014

Choose a reason for hiding this comment

johnynek Sep 18, 2014

Choose a reason for hiding this comment

egonina Sep 18, 2014

Choose a reason for hiding this comment

johnynek Sep 18, 2014

Choose a reason for hiding this comment

egonina commented Sep 18, 2014

johnynek commented Sep 18, 2014