Basic reducer estimator support #973

bholt · 2014-07-25T01:39:56Z

Add support for reducer estimators. Plugs into Cascading's FlowStepStrategy to run the estimator before the invocation of each job step.

Which reducer estimator is run is controlled by configuration parameters specifying the classname. Another flag controls whether the estimate should override cases where user has explicitly set reducers (Grouped.withReducers()).

Currently only one estimator has been implemented: InputSizeReducerEstimator, which will look at the amount of input data (at each step) and choose a number of reducers based on the specified "bytes.per.reducer" parameter.

hadoop myjob.jar \
  -Dscalding.reducer.estimator=com.twitter.scalding.estimator.InputSizeReducerEstimator \
  -Dscalding.reducer.estimator.override=true \
  -Dscalding.reducer.estimator.bytes.per.reducer=1024 \
  com.twitter.scalding.MyJob --hdfs

This PR differs from what is currently working its way into internal Twitter by integrating into the ExecutionContext directly, so reducer estimation can be available from old-style Jobs, scalding-as-a-library uses, and in the REPL.

… path

…gs in jobconf

…te with things from Twitter internal version, pass Config to LocalCluster.initialize

johnynek · 2014-07-25T02:06:56Z

scalding-core/src/main/scala/com/twitter/scalding/Job.scala

+  /**
+   * Specify a callback to run before the start of each flow step.
+   *
+   * Defaults to what Config.getReducerEstimator specifies.


default is a weird term because in fact, both can be applied.

Well, if you set a stepStrategy here it would override the reducer estimator provided by ExecutionContext.buildFlow. So I don't think you can have both. Perhaps default isn't the right way to describe it regardless.

I thought cascading allows you to have more than one. I didn't check carefully though.

…HadoopPlatformTestJob

bholt · 2014-07-25T21:35:48Z

Added a way to test the job configuration after the fact. This required saving the Flow from Job.run. Open to suggestions for a better place to make this change. We could restrict this change to MiniMRCluster tests by subclassing Job and enforcing that HadoopPlatformTestJob takes this subclass instead of Job.

Also something to consider is if it's worth making the "inspectCompletedFlow" more special-purpose. We could remove some boilerplate, even in these two tests, by making a more specific "inspectFlowSteps" or something like that.

jcoveney · 2014-07-25T22:51:05Z

More special-purpose how? What do you have in mind? Part of the motivation of the LocalCluster is to be able to inspect things like this, so this is actually quite useful. In 0.9.0 we had a regression where .groupAll.sum wasn't being optimized, for example, and I think this would let us do that. But perhaps a helper method or two is a good idea!

bholt · 2014-07-25T23:18:12Z

Yeah, all I can think of is something that does the "flow.getFlowSteps.asScala" for you, but thinking it's not really worth making such a simple helper. Having the whole flow at your disposal does open a bunch of things you can check.

- prettier InputSizeReducerEstimator w/ foldLeft/match - change some config parameter names - throw ClassNotFoundException/InstantiationException

…addition) Conflicts: scalding-core/src/main/scala/com/twitter/scalding/Execution.scala

johnynek · 2014-07-28T19:49:50Z

scalding-core/src/main/scala/com/twitter/scalding/Config.scala

@@ -35,6 +38,9 @@ import scala.util.{ Failure, Success, Try }
 * This is a wrapper class on top of Map[String, String]
 */
 trait Config {
+
+  private val LOG = LoggerFactory.getLogger(this.getClass)


Is this being used? I am nervous about it causing serialization errors. All kinds of things in scalding are serialized, and loggers have given issues before.

my bad; those were in there to report the exceptions (ClassNotFoundException, InstantiationException) that we decided to just throw now.

… warning

johnynek · 2014-07-28T19:57:39Z

scalding-core/src/main/scala/com/twitter/scalding/estimator/Common.scala

@@ -0,0 +1,75 @@
+package com.twitter.scalding.estimator


maybe reducerestimator or runtimeconfig? We need something more descriptive than estimator, or more general (something to encompass flow plan changes?)

Absolutely. Which would you prefer? I guess the question is, what else can you imagine putting in this package?

also, since package names are generally lowercase, do we need an "_" for separating words? Can we come up with a single-word name?

I guess reducer_estimation is good. Naming is hard.

Works for me. Long names with tons of subdirectories is the name of the JVM game.

…owStepStrategy'

johnynek · 2014-07-31T00:20:03Z

merge when green.

reconditesea · 2014-07-31T00:26:18Z

So great to have this feature in Scalding!

On Wed, Jul 30, 2014 at 5:20 PM, P. Oscar Boykin notifications@github.com
wrote:

merge when green.

—
Reply to this email directly or view it on GitHub
#973 (comment).

Kevin Lin | Twitter, Inc.
1355 Market St. | San Francisco, CA | 94103

Follow me: @reconditesea https://twitter.com/reconditesea

Basic reducer estimator support

sriramkrishnan · 2014-07-31T17:35:56Z

Good stuff @bholt!

Brandon Holt added 11 commits July 15, 2014 15:30

dummy estimator plugging into FlowStepStrategy

b035ed6

runnable MiniMRCluster test

d5c970d

implement basic 'inputSize/bytesPerReducer' estimator

cee4953

test multi-step job in MiniMRCluster

53ca0af

minor changes/cleanup

8997346

make ReducerEstimator configurable by config variable

923056c

allow passing config to MiniMRCluster jobs, add needed Log4j class to…

58a320a

… path

rename/refactor/move

01b4989

factor out common estimator code, handle override/explicit, save thin…

9ca70f9

…gs in jobconf

move reducerEstimator creation into ExecutionContext via Config, upda…

7773f45

…te with things from Twitter internal version, pass Config to LocalCluster.initialize

de-dup a couple config params

4bb9d5a

johnynek reviewed Jul 25, 2014
View reviewed changes

Brandon Holt added 3 commits July 25, 2014 11:31

format class list

b1f705f

add 'completedFlow' output on Job, use for 'inspectCompletedFlow' in …

882910d

…HadoopPlatformTestJob

move 'HadoopNumReducers' into out into Config

e91fe22

Brandon Holt added 2 commits July 25, 2014 18:05

changes from internal reviewboard

c59511c

- prettier InputSizeReducerEstimator w/ foldLeft/match - change some config parameter names - throw ClassNotFoundException/InstantiationException

Merge branch 'develop' into bholt/estimator (had to move "buildFlow" …

c393fb2

…addition) Conflicts: scalding-core/src/main/scala/com/twitter/scalding/Execution.scala

johnynek reviewed Jul 28, 2014
View reviewed changes

remove unused loggers, use 'Option.collect' to get rid of compilation…

8400351

… warning

johnynek reviewed Jul 28, 2014
View reviewed changes

Brandon Holt added 2 commits July 28, 2014 13:02

add comment about 'completedFlow' var

be8ffac

refactor package name estimator -> reducer_estimation

b793ad7

return type of 'getReducerEstimator' should be more specific than 'Fl…

9c0fa8c

…owStepStrategy'

bholt added a commit that referenced this pull request Jul 31, 2014

Merge pull request #973 from twitter/bholt/estimator

76b4030

Basic reducer estimator support

bholt merged commit 76b4030 into develop Jul 31, 2014

bholt deleted the bholt/estimator branch July 31, 2014 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic reducer estimator support #973

Basic reducer estimator support #973

bholt commented Jul 25, 2014

johnynek Jul 25, 2014

bholt Jul 25, 2014

johnynek Jul 25, 2014

bholt commented Jul 25, 2014

jcoveney commented Jul 25, 2014

bholt commented Jul 25, 2014

johnynek Jul 28, 2014

bholt Jul 28, 2014

johnynek Jul 28, 2014

bholt Jul 28, 2014

bholt Jul 28, 2014

johnynek Jul 30, 2014

bholt Jul 30, 2014

johnynek commented Jul 31, 2014

reconditesea commented Jul 31, 2014

sriramkrishnan commented Jul 31, 2014

Basic reducer estimator support #973

Basic reducer estimator support #973

Conversation

bholt commented Jul 25, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bholt commented Jul 25, 2014

jcoveney commented Jul 25, 2014

bholt commented Jul 25, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jul 31, 2014

reconditesea commented Jul 31, 2014

sriramkrishnan commented Jul 31, 2014