-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache Execution evaluations #1057
Conversation
@egonina Want to take a look? I think this fixes the issue you noted today. This is aggressively caching everything, which is probably appropriate for map/reduce. But if someone gets the idea to set up a server running in an infinite recursive flatMap loop, this is going to OOM. I think we can cross that bridge if we ever get a bug report. |
* error to add something to the cache twice clearly). | ||
*/ | ||
def getOrElseInsert[T](ex: Execution[T], | ||
res: => (EvalCache, Future[(T, ExecutionCounters, EvalCache)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what the res function parameter is doing? Is it the result of the execution, in which case, should it also be a Future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using the lazy parameter (: =>)
in the same way a mutable map does with .getOrElseUpdate. So, if we have not already evaluated ex, put the code to evaluate it there. Now you can actually get the future into the cache before it is complete and so parallel operations (such as the zip) will not both recompute parents they need. That is why we return both the cache BEFORE the future is complete (BUT containing this execution) and after it is complete, as the flatMap may cause more Executions to be evaluated along the way.
If we only had the cache after the future, zips would be made into a sequence, and the case of a diamond (b and c depend on a and b.zip(c) is what you are evaluating), we would still serialize. Here, we will get a into the cache when we start return from b, and then c can look at the cache now, and see a, and only wait for a, not b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would zips made into a sequence here because in case of b.zip(c) c considers b to be a parent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b.zip(c) should generally not have b being a parent of c, but it could be. Imagine: b.zip(b.map(fn)). we would first schedule the execution of b, then when scheduling c = b.map(fn) we would find b in the cache, and would reuse the future there.
Have you tested this? |
I have tested it, in that, the current kmeans test runs and exercises it some (and the test runs much faster). But I am working on adding some more comprehensive tests. |
closes #1055