-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic reducer estimator support #973
Conversation
…te with things from Twitter internal version, pass Config to LocalCluster.initialize
/** | ||
* Specify a callback to run before the start of each flow step. | ||
* | ||
* Defaults to what Config.getReducerEstimator specifies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is a weird term because in fact, both can be applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if you set a stepStrategy here it would override the reducer estimator provided by ExecutionContext.buildFlow. So I don't think you can have both. Perhaps default isn't the right way to describe it regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought cascading allows you to have more than one. I didn't check carefully though.
Added a way to test the job configuration after the fact. This required saving the Flow from Job.run. Open to suggestions for a better place to make this change. We could restrict this change to MiniMRCluster tests by subclassing Job and enforcing that HadoopPlatformTestJob takes this subclass instead of Job. Also something to consider is if it's worth making the "inspectCompletedFlow" more special-purpose. We could remove some boilerplate, even in these two tests, by making a more specific "inspectFlowSteps" or something like that. |
More special-purpose how? What do you have in mind? Part of the motivation of the LocalCluster is to be able to inspect things like this, so this is actually quite useful. In 0.9.0 we had a regression where .groupAll.sum wasn't being optimized, for example, and I think this would let us do that. But perhaps a helper method or two is a good idea! |
Yeah, all I can think of is something that does the "flow.getFlowSteps.asScala" for you, but thinking it's not really worth making such a simple helper. Having the whole flow at your disposal does open a bunch of things you can check. |
- prettier InputSizeReducerEstimator w/ foldLeft/match - change some config parameter names - throw ClassNotFoundException/InstantiationException
…addition) Conflicts: scalding-core/src/main/scala/com/twitter/scalding/Execution.scala
@@ -35,6 +38,9 @@ import scala.util.{ Failure, Success, Try } | |||
* This is a wrapper class on top of Map[String, String] | |||
*/ | |||
trait Config { | |||
|
|||
private val LOG = LoggerFactory.getLogger(this.getClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being used? I am nervous about it causing serialization errors. All kinds of things in scalding are serialized, and loggers have given issues before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad; those were in there to report the exceptions (ClassNotFoundException, InstantiationException) that we decided to just throw now.
@@ -0,0 +1,75 @@ | |||
package com.twitter.scalding.estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe reducerestimator or runtimeconfig? We need something more descriptive than estimator, or more general (something to encompass flow plan changes?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Which would you prefer? I guess the question is, what else can you imagine putting in this package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, since package names are generally lowercase, do we need an "_" for separating words? Can we come up with a single-word name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess reducer_estimation is good. Naming is hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. Long names with tons of subdirectories is the name of the JVM game.
merge when green. |
So great to have this feature in Scalding! On Wed, Jul 30, 2014 at 5:20 PM, P. Oscar Boykin notifications@github.com
Kevin Lin | Twitter, Inc. Follow me: @reconditesea https://twitter.com/reconditesea |
Good stuff @bholt! |
Add support for reducer estimators. Plugs into Cascading's FlowStepStrategy to run the estimator before the invocation of each job step.
Which reducer estimator is run is controlled by configuration parameters specifying the classname. Another flag controls whether the estimate should override cases where user has explicitly set reducers (
Grouped.withReducers()
).Currently only one estimator has been implemented:
InputSizeReducerEstimator
, which will look at the amount of input data (at each step) and choose a number of reducers based on the specified "bytes.per.reducer" parameter.This PR differs from what is currently working its way into internal Twitter by integrating into the ExecutionContext directly, so reducer estimation can be available from old-style Jobs, scalding-as-a-library uses, and in the REPL.