Add generic TypedPipe optimization rules #1724

johnynek · 2017-09-25T00:33:08Z

An example of optimization rules and how they can simplify and factor our implementation.

This also helps motivate #1718

ianoc · 2017-09-25T22:05:13Z

scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala

+
+  /**
+   * In map-reduce settings, Merge is almost free in two contexts:
+   * 1. the final write


This rule i'm not sure in a MR context what the gain is. I think deferring the merge is only useful for filters maybe? since if you have two steps

M -> R1
M -> R2
and do some (R1 ++ R2).filter(...) we know it should be good. But doing flatmaps and maps pushed in there isn't as clear of a win to me?

as I mentioned in the comment, if you can defer the merge until a groupBy, scalding can skip doing any work. In tez, if not, that is actually an extra materialization. On MR, cascading applies this same optimization under the hood (but not, for some reason, with Tez).

Also, clearly, for writes, it is a win.

Note, each tuple is only one one side or the other, by pushing it lower (closer to sinks) in the graphs, we don't do more work. So it is hard to see how it hurts also.

Which side do the functions goto in a group by, this might change how things get planned on MR negatively?

I.e.

M -> R1 M -> R2 (R1 ++ R2).flatMap { huge data cubing function}.sumByKey

If the flat map is pushed through the merge its more likely it'll end up on the reducers and you'll write a lot more data. This isn't entirely under the control of scalding in general ofc, but this sort of rule could change how the DAG is exposed to cascading and cause odd regressions?

Ah so the notion here is that if you merge 2 sources say from disk, then do some map operations and a group by. Under tez cascading will introduce an extra step to do the merge pre-map operations?

For writes its a win under normal MR? is it?

given only 1 set of mappers or reducers will write to any sink won't the effects be sort of random depending on data sizes?

M -> R(f1,f2) ------> MergeMapperToSink M -> R2(f1,f2) ---/ M -> R ------> MergeMapperToSink(f1,f2) M -> R2 ---/

Which one of these is better largely depends on the shape of f1 and f2 no? (i.e. input + output data types and how well they serialize/compress, if they expand the data or similar?)

Some examples would be something like writing ngrams from multiple sources, in this case you'd want to merge to be late since you actually write more tuples possibly than you had in either of the previous steps. (Not so common granted, but I did it recently so comes to mind).

@ianoc - yeah I think I ran into the merge + map + groupBy issue in cascading 3 - cwensel/cascading#59. From what I remember there was an extra merge node.

ianoc · 2017-09-25T22:08:26Z

scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala

+
+  /**
+   * This is an optimization we didn't do in scalding 0.17 and earlier
+   * because .toTypedPipe on the group totally hid the structure from


Very nice! this is a great optimization to get in

ianoc · 2017-09-25T22:11:22Z

I presume our existing test coverage really validates these are fine changes to make from a correctness POV. Do you plan to merge this, if so any thoughts if we need tests to say the rules do what the rules say they do?

johnynek · 2017-09-25T22:20:29Z

@ianoc I think we should have tests for any rule.

To do this, we probably want to back all all rules from the default cascading planner so it just does a literal translation of scalding -> cascading. Then the test is, before and after the rule, the result is the same after we plan it to cascading and run it.

So, this is just hear to show a set of rules to motivate merging #1718 which I think is the high priority thing.

johnynek · 2017-09-26T00:11:32Z

We have actually always applied this rule. This is one of the rules that is not new. We could stop doing it, but I worry that could cause regressions. I agree that in some contexts it will be a win and some not. But I would be nervous about changing current behavior.

On Mon, Sep 25, 2017 at 12:35 ianoc ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala <#1724 (comment)>: > + /** + * We ignore .group if there are is no setting of reducers + * + * This is arguably not a great idea, but scalding has always + * done it to minimize accidental map-reduce steps + */ + object IgnoreNoOpGroup extends PartialRule[TypedPipe] { + def applyWhere[T](on: Dag[TypedPipe]) = { + case ReduceStepPipe(IdentityReduce(_, input, None, _)) => + input + } + } + + /** + * In map-reduce settings, Merge is almost free in two contexts: + * 1. the final write Which side do the functions goto in a group by, this might change how things get planned on MR negatively? I.e. M -> R1 M -> R2 (R1 ++ R2).flatMap { huge data cubing function}.sumByKey If the flat map is pushed through the merge its more likely it'll end up on the reducers and you'll write a lot more data. This isn't entirely under the control of scalding in general ofc, but this sort of rule could change how the DAG is exposed to cascading and cause odd regressions? Ah so the notion here is that if you merge 2 sources say from disk, then do some map operations and a group by. Under tez cascading will introduce an extra step to do the merge pre-map operations? For writes its a win under normal MR? is it? given only 1 set of mappers or reducers will write to any sink won't the effects be sort of random depending on data sizes? M -> R(f1,f2) ------> MergeMapperToSink M -> R2(f1,f2) ---/ M -> R ------> MergeMapperToSink(f1,f2) M -> R2 ---/ Which one of these is better largely depends on the shape of f1 and f2 no? (i.e. input + output data types and how well they serialize/compress, if they expand the data or similar?) Some examples would be something like writing ngrams from multiple sources, in this case you'd want to merge to be late since you actually write more tuples possibly than you had in either of the previous steps. (Not so common granted, but I did it recently so comes to mind). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1724 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEJdnvlzRkM16Da1W0L5cG7scruuXtCks5smCrKgaJpZM4PiFd_> .

-- P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin

piyushnarang

I haven't checked out these changes and run them so maybe this is a trivial question, do we log out the various optimization rules applied? I'm guessing it might be useful for users to see what set of optimization rules got applied to their Scalding job / execution graph. (Could follow this up in a future PR as we don't have it as of today).

piyushnarang · 2017-09-28T08:12:00Z

scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala

@@ -334,11 +334,218 @@ object OptimizationRules {
      Binary(recurse(hj.left), rightLit,
        { (ltp: TypedPipe[(K, V)], rtp: TypedPipe[(K, V2)]) =>
          rtp match {
-            case ReduceStepPipe(hg: HashJoinable[K, V2]) =>
+            case ReduceStepPipe(hg: HashJoinable[K @unchecked, V2 @unchecked]) =>


is the @unchecked something we can move to the start of the case?

piyushnarang · 2017-09-28T08:15:19Z

scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala

+  //
+  /////////////////////////////
+
+  object ComposeFlatMap extends PartialRule[TypedPipe] {


wondering if it would be a good idea to include a short snippet of code as a comment on the top of each rule showing which scenario it gets applied in?

piyushnarang · 2017-09-28T08:20:24Z

scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala

+   * This is arguably not a great idea, but scalding has always
+   * done it to minimize accidental map-reduce steps
+   */
+  object IgnoreNoOpGroup extends PartialRule[TypedPipe] {


This is something that has tripped a lot of first time users of scalding. Should we add a warning log line when this rule is being applied to say that we're ignoring the group?

Adding a warning log line is not a bad idea in general. Not sure the right place to do that. I kind of hate to do it in the optimizer, since you might not even be running the result. Let me think about it.

piyushnarang · 2017-09-28T08:25:48Z

scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala

+
+  /**
+   * In map-reduce settings, Merge is almost free in two contexts:
+   * 1. the final write


@ianoc - yeah I think I ran into the merge + map + groupBy issue in cascading 3 - cwensel/cascading#59. From what I remember there was an extra merge node.

johnynek · 2017-10-01T01:52:40Z

@ianoc can you take another look?

I have added tests of correctness, which I think are reasonable tests. They don't actually test that something about the graph got better. I want to do that too eventually, but maybe not in the first version. Something like either number of map-reduce steps is <= original or total DAG depth is <= original depth. Cascading doesn't make it trivial do that kind of test, I don't think (I have to print the DAG to a dot file and then parse it I think).

ianoc · 2017-10-03T18:42:46Z

scalding-core/src/main/scala/com/twitter/scalding/typed/Joiner.scala

+  type JoinFn[K, V, U, R] = (K, Iterator[V], Iterable[U]) => Iterator[R]
+  type HashJoinFn[K, V, U, R] = (K, V, Iterable[U]) => Iterator[R]
+
+  def toCogroupJoiner2[K, V, U, R](hashJoiner: (K, V, Iterable[U]) => Iterator[R]): JoinFn[K, V, U, R] =


Can you add a comment on this method? (motivation as much as anything -- not sure why we would normally want to turn a hash join into a JoinFn normally?)

also I think we can use the HashJoinFn[K, V, U,R] type alias as the type of the arg to this method

will do these in the follow up, if that's okay.

We never use this method in fact, but I didn't want to remove it in the interest of being a nice guy to users that may be.

Yep totes. and makes sense to leave or kill for the obvious reasons to do either

ianoc · 2017-10-03T18:54:49Z

scalding-core/src/main/scala/com/twitter/scalding/typed/OptimizationRules.scala

+  }
+
+  /**
+   * a.onComplete(f).onComplete(g) == a.onComplete { () => f(); g() }


when/where do we use an OnComplete (i.e. given its a side effect do we need to ensure all of these are ran // do try/error handling around each invocation?)

I don't think we ever do. Users at Twitter used this to make some call to an API after they finished doing some side-effecting map or reduce. Other users have asked for it. I kind of hate the API now since I think it is an anti-pattern, but alas, I don't want to remove it (and we can optimize it, since we can move it to the very end of a map phase commuting with other map operations.

travisbrown · 2017-10-03T18:59:43Z

@johnynek I'm not terribly familiar with the Cascading APIs but wouldn't Flow#getFlowStats give you the info you need for that test without having to parse the DOT file?

ianoc · 2017-10-03T21:28:21Z

This lgtm

we don't have code coverage for this project right? It might be nice to know how many of the rules our generated dag gives us coverage for. (not a merge blocker)

johnynek added 6 commits September 22, 2017 23:32

Implement Dagon.toLiteral

52eae5e

reduce stack depth

b86a42a

merge with master, sort TypedPipe cases

8cc51ea

Add generic TypedPipe optimization rules

6e8029f

fix compilation error, add a few more rules

13b0b79

fix serialization issue with 2.12

4e12145

ianoc reviewed Sep 25, 2017

View reviewed changes

piyushnarang reviewed Sep 28, 2017

View reviewed changes

johnynek added 2 commits September 30, 2017 15:30

merge with master

c8bd451

Add tests of correctness to optimization rules

f0488b0

johnynek changed the title ~~[WIP] Add generic TypedPipe optimization rules~~ Add generic TypedPipe optimization rules Oct 1, 2017

add comments, improve some rules

70f5bca

johnynek changed the base branch from oscar/use-dagon to develop October 1, 2017 19:16

johnynek added 2 commits October 2, 2017 08:34

fix bug with outerjoin

4ae54cf

fix match error

9f70845

ianoc reviewed Oct 3, 2017

View reviewed changes

johnynek merged commit 1a52975 into develop Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generic TypedPipe optimization rules #1724

Add generic TypedPipe optimization rules #1724

johnynek commented Sep 25, 2017

ianoc Sep 25, 2017

johnynek Sep 25, 2017

ianoc Sep 25, 2017

piyushnarang Sep 28, 2017

ianoc Sep 25, 2017

ianoc commented Sep 25, 2017

johnynek commented Sep 25, 2017

johnynek commented Sep 26, 2017 via email

piyushnarang left a comment

piyushnarang Sep 28, 2017

piyushnarang Sep 28, 2017

piyushnarang Sep 28, 2017

johnynek Oct 1, 2017

piyushnarang Sep 28, 2017

johnynek commented Oct 1, 2017

ianoc Oct 3, 2017

johnynek Oct 3, 2017

ianoc Oct 4, 2017

ianoc Oct 3, 2017

johnynek Oct 3, 2017

travisbrown commented Oct 3, 2017

ianoc commented Oct 3, 2017

Add generic TypedPipe optimization rules #1724

Add generic TypedPipe optimization rules #1724

Conversation

johnynek commented Sep 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianoc commented Sep 25, 2017

johnynek commented Sep 25, 2017

johnynek commented Sep 26, 2017 via email

piyushnarang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Oct 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisbrown commented Oct 3, 2017

ianoc commented Oct 3, 2017