Add `withDescription` for naming MR steps #1283

aposwolsky · 2015-05-09T04:31:11Z

Adding ability to have more meaningful names in MR steps instead of just "(X/Y")

johnynek · 2015-05-10T18:12:05Z

This is pretty great. Thanks for doing this. I'll review it carefully this week.

aposwolsky · 2015-05-14T18:13:34Z

will change. any other issues? @johnynek ?

johnynek · 2015-05-14T19:58:28Z

scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala

@@ -31,6 +36,25 @@ trait ExecutionContext {
  def flowDef: FlowDef
  def mode: Mode

+  def getIdentifierOpt(descriptions: Seq[String]): Option[String] = {
+    if (descriptions.nonEmpty) {
+      Some(descriptions.distinct.mkString(", "))


can we sort these to make sure it is stable? Not really sure we should sort on second thought. It would be good if the order is roughly the order that the descriptions came in.

let's make this a private method on object ExecutionContext.

agreed to make it private. I wanted the sorting to be in order and did the distinct just to be safe.

johnynek · 2015-05-14T20:33:42Z

Have you tried this on your cluster at all? It's hard to be sure we are hitting all the cases (hashJoin, joins, groupBys, forceToDisk are all different code paths). Could we beef up the tests at the very least to try each of those cases?

aposwolsky · 2015-05-14T20:55:06Z

Yes tried it on a bunch of our jobs which used all those cases (except for
hashJoin perhaps). I'll add more tests!

On Thu, May 14, 2015 at 4:33 PM, P. Oscar Boykin notifications@github.com
wrote:

Have you tried this on your cluster at all? It's hard to be sure we are
hitting all the cases (hashJoin, joins, groupBys, forceToDisk are all
different code paths). Could we beef up the tests at the very least to try
each of those cases?

Reply to this email directly or view it on GitHub
#1283 (comment).

johnynek · 2015-05-14T22:05:28Z

Making a macro to get the current line and file would be useful with this:

.withDescription(fileAndLine) could be useful to make an enrichment for: .noteFile or something could do this.

ianoc · 2015-05-15T22:02:08Z

scalding-core/src/main/scala/com/twitter/scalding/RichPipe.scala

+  }
+
+  def setPipeDescriptions(p: Pipe, descriptions: Seq[String]): Pipe = {
+    p.getStepConfigDef().setProperty(


This will mostly work, but... our OrderedSerialization issues seem to have popped up that these won't always go together :/ -- we tried to use these to configure serializers for each step. Do an op that would do some transform then added the config to that pipe to let it work. But then the two would be moved apart by cascading. Not a blocker to this, but it unfortunately would mean the naming ends up being more best effort than would be ideal I think

I see your point, and I wonder if we remove "withDescription" from typedPipe and only allow it after you ".group" if that would guarantee better that it is attached to the step where it occurs ... but then it is also less powerful and it is nice to annotate descriptions everywhere

Funnily enough after a group by is exactly where we've been seeing it fail :/ I'm not sure its worth altering the plan here. This is a big step up in info/usability. We might just want to include a little note about YMMV i guess in a readme

ianoc · 2015-05-18T18:17:25Z

Looks like this has legit failures against latest develop now, doesn't compile

johnynek · 2015-05-18T20:30:07Z

scalding-core/src/main/scala/com/twitter/scalding/typed/TypedPipe.scala

@@ -1036,6 +1040,19 @@ class WithOnComplete[T](typedPipe: TypedPipe[T], fn: () => Unit) extends TypedPi
    forceToDiskExecution.flatMap(_.toIterableExecution)
 }

+class WithDescriptionTypedPipe[T](typedPipe: TypedPipe[T], description: String) extends TypedPipe[T] {
+  override def toPipe[U >: T](fieldNames: Fields)(implicit flowDef: FlowDef, mode: Mode, setter: TupleSetter[U]) = {


this should be asPipe now.

ianoc · 2015-05-18T20:56:24Z

@aposwolsky we are aiming to freeze for a rls on wednesday. It would be awesome to get this in for that if we can

ianoc · 2015-05-20T17:31:45Z

@aposwolsky we are looking to freeze for our next rls at 5pm pst today, you going to have a chance to look at this before then?

aposwolsky · 2015-05-21T18:37:00Z

I was OOO the past few days, but I rebased and pushed the change from toPipe to asPipe

johnynek · 2015-05-21T20:01:32Z

scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala

+    val conf = step.getConfig
+    getIdentifierOpt(getDesc(step)).foreach(id => {
+      val newJobName = "%s %s".format(conf.getJobName, id)
+      println("Added descriptions to job name: %s".format(newJobName))


can we get a log.info or debug here instead of a println?

also, this looks like it is changing how all jobs are named. That is somewhat orthogonal to descriptions, right?

We have a lot of tooling that assumes the job names are not changed. Could we instead just add a key to the jobConf?

hmmm, we actually like it to show it in hadoop jobtracker, what about we make it configurable so we will keep it setting name and we can default to something else?

also, the "job name" always had "(X/Y)" appending to it, all this is doing is appending extra stuff after that, not changing the prefix.

We have some analytics that run on our jobs and they match on the job name. For instance, we have a reducer estimator that runs and looks at the names to get the history of the particular step. Others might have other systems running.

I think changing how the jobs are names could be disruptive for people.

Can we find a way to opt-in to this behavior and you can put those options as default in your clusters? Something like a configuration key like "scalding.description.addtoname" -> "true" or something? When this is not present, we just add an entry to the jobConf?

I already pushed a change that would put the descriptions in a different config and not update mapred.name. Since we want it appended to the name I will add a steplistener on our side that will just add the value to that config to the job name on our side... win win for everyone :)

aposwolsky · 2015-05-28T20:10:09Z

Made suggested changes

aposwolsky · 2015-05-29T17:09:13Z

how do we re-run the tests, it seems to have failed on spurious errors:
15/05/28 22:09:48 ERROR mapred.TaskTracker: Caught exception: java.io.IOException: Call to localhost/127.0.0.1:35898 failed on local exception: java.nio.channels.ClosedByInterruptException

johnynek · 2015-06-04T19:50:39Z

looking

johnynek · 2015-06-04T19:56:32Z

scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala

+          flowSteps.foreach(step => {
+            val baseFlowStep: BaseFlowStep[JobConf] = step.asInstanceOf[BaseFlowStep[JobConf]]
+            val descriptions = getDesc(baseFlowStep)
+            updateFlowStepName(baseFlowStep, descriptions)


looks like you are still updating the job name. Like I said, I don't want to change that without more thought and buy in. Our existing job monitoring will be impacted and probably other systems will as well. I know we should have a test for anything we care about, and we can add an issue to do this here (if indeed we want some contract on the name and not another way to track job identity).

The mapred.name is already set by the time it gets here, so the job.xml file for hadoop will not have any mention of the descriptions. However, the "stepName" needs to be updated for the dot file generation to have the descriptions

aposwolsky · 2015-06-08T22:10:51Z

@johnynek : ok, I changed it so it only changed the flow stepName in the .dot file generation, otherwise it just sets a different config. This works for our setup since (1) .dot files have descriptions and (2) I added a flowstepstrategy that adds the descriptions to the job name so it looks good on our side.

Does this solution work for you as well? I would like to see this pull request go through because I don't want to have to start maintaining a patch.

aposwolsky · 2015-06-10T01:27:20Z

ping?

aposwolsky · 2015-06-12T21:42:08Z

ping?

johnynek · 2015-06-13T01:02:25Z

Looking now. Sorry.

johnynek · 2015-06-13T01:04:22Z

scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala

+    if (descriptions.nonEmpty) Some(descriptions.distinct.mkString(", ")) else None
+  }
+
+  private def updateStepConfigWithDescriptions(step: BaseFlowStep[JobConf], descriptions: Seq[String]): Unit = {


descriptions is unused, right?

good catch! thanks will fix

johnynek · 2015-06-13T01:11:26Z

One minor issue, then in looks good to merge (remove an unused parameter). Thanks for pushing this through to the end.

aposwolsky · 2015-06-15T19:13:30Z

no problem! made changes

johnynek · 2015-06-15T19:23:45Z

Thanks! 👍 Will merge when it's green.

aposwolsky · 2015-06-15T20:33:49Z

:( can we re-run the CI build?

johnynek · 2015-06-15T21:02:54Z

restarted

aposwolsky · 2015-06-15T22:28:51Z

success!

Add `withDescription` for naming MR steps

johnynek · 2015-06-15T22:31:34Z

Thanks for this nice new feature!

aposwolsky · 2015-06-15T22:33:54Z

anytime :)

joshualande · 2015-08-12T02:32:23Z

@aposwolsky, thanks for adding this feature. It is really great!

johnynek reviewed May 14, 2015
View reviewed changes

johnynek closed this May 14, 2015

johnynek reopened this May 14, 2015

ianoc reviewed May 15, 2015
View reviewed changes

johnynek reviewed May 18, 2015
View reviewed changes

aposwolsky force-pushed the develop branch from 7875b81 to 4ea3239 Compare May 21, 2015 18:34

johnynek reviewed May 21, 2015
View reviewed changes

aposwolsky force-pushed the develop branch from fdad125 to acaa332 Compare May 28, 2015 20:09

johnynek reviewed Jun 4, 2015
View reviewed changes

Adam Poswolsky added 7 commits June 8, 2015 18:40

Add withDescription for naming MR steps

fa65548

Post-review cleanup: WithDescription

e52b7b9

Remove unnecessary StepStrategy for WithDescription

cbda0ec

Rebase upstream/master and fix compile error

0b3b448

Step descriptions stored in config, doesn't change mapred name

a5db475

WithDescription encoding must be printable char.

33f14e8

Only set flow stepName for .dot file generation

f8a23b7

aposwolsky force-pushed the develop branch from 0d66dde to f8a23b7 Compare June 8, 2015 21:12

johnynek reviewed Jun 13, 2015
View reviewed changes

Remove unused parameter

6f31533

johnynek added a commit that referenced this pull request Jun 15, 2015

Merge pull request #1283 from aposwolsky/develop

70675dd

Add `withDescription` for naming MR steps

johnynek merged commit 70675dd into twitter:develop Jun 15, 2015

ianoc mentioned this pull request Aug 10, 2015

Release 0.16.0 #1413

Closed

Add withDescription for naming MR steps #1283

Add withDescription for naming MR steps #1283

Conversation

aposwolsky commented May 9, 2015

johnynek commented May 10, 2015

aposwolsky commented May 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented May 14, 2015

aposwolsky commented May 14, 2015

johnynek commented May 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianoc commented May 18, 2015

Choose a reason for hiding this comment

ianoc commented May 18, 2015

ianoc commented May 20, 2015

aposwolsky commented May 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aposwolsky commented May 28, 2015

aposwolsky commented May 29, 2015

johnynek commented Jun 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aposwolsky commented Jun 8, 2015

aposwolsky commented Jun 10, 2015

aposwolsky commented Jun 12, 2015

johnynek commented Jun 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jun 13, 2015

aposwolsky commented Jun 15, 2015

johnynek commented Jun 15, 2015

aposwolsky commented Jun 15, 2015

johnynek commented Jun 15, 2015

aposwolsky commented Jun 15, 2015

johnynek commented Jun 15, 2015

aposwolsky commented Jun 15, 2015

joshualande commented Aug 12, 2015

Add `withDescription` for naming MR steps #1283

Add `withDescription` for naming MR steps #1283