Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Rounding Bug in RatioBasedEstimator #1542

Merged
merged 8 commits into from Apr 4, 2016
Merged

Conversation

isnotinvain
Copy link
Contributor

Fixes the rounding bug mentioned in #1541, and adds a maximum for reducer estimation as well.

*/
override def estimateReducers(info: FlowStrategyInfo): Option[Int] =
def estimateExactReducers(info: FlowStrategyInfo): Option[Double] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a bad name

@johnynek
Copy link
Collaborator

Can you have a test that shows this issue and that this fixes it?

@isnotinvain
Copy link
Contributor Author

@johnynek yes:

Needs tests, and some discussion about a hard cap as well

@isnotinvain
Copy link
Contributor Author

Running the new test on the develop branch yields:

[info] - should handle mapper output explosion over small data correctly *** FAILED ***
[info]   1000 did not equal 2 (RatioBasedEstimatorTest.scala:216)

@isnotinvain
Copy link
Contributor Author

I'm adding a cap as well @gerashegalov

@isnotinvain isnotinvain changed the title [WIP] Initial pass at Issue 1541 Fix Rounding Bug in RatioBasedEstimator Apr 1, 2016
val maxEstimatedReducersKey = "scalding.reducer.estimator.max.estimated.reducers"

// TODO: what's a reasonable default? Int.maxValue? 5k? 100k?
val defaultMaxEstimatedReducers = 100 * 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100K seems way too many, no? Pig's max is 999.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have no idea. I guess I can spot check some jobs in our hadoop cluster.
But if you can have 30k mappers, why not 30k reducers? and if 30k is somewhat normal, then 100k isn't that far off in terms of being "way too much"

I don't know whether pig's max of 999 was chosen with much thought or what size of cluster it was chosen for.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason I can think of - too many reducers equals too many files.

Yeah, I don't know the historic reason for 999 reducer. I think Hive has the same too.

@isnotinvain
Copy link
Contributor Author

I can make the max non-fatal, though I'm not entirely convinced.

Having the max act as a cap means that it's another tuning dimension that users need to worry about, that might be causing them to have an unexpected number of reducers (if the cap is set too low).

On the other hand, if we set the cap relatively high, as in a value where we think "anything that estimates needing this is clearly broken" and treat it as exceptional, this becomes a faster feedback mechanism for users. I actually think that in many cases fail-fast is a much better user experience than "just try to make it work and hope". The grey area here is whether this is one of those cases or not. If the estimator estimates needing a ton of reducers, they are either needed, and capping will probably not be great, or the estimator is broken.

That's my pitch, but I don't feel too strongly if everyone else want the cap to not be exceptional.

@johnynek
Copy link
Collaborator

johnynek commented Apr 1, 2016

thinking as a user, I think I would not want scalding to (at least by default) fail to run if it uses too many reducers.

If I set N as the max, and it runs slow, I should have alerts that I notice, and I can rewrite my job or turn up the max. If it runs too fast because of a bug (and I waste resources) I should have a system to watch that too, and if not, set a lower max so the waste is not that bad.

I have a hard time imagining making the default to fail the job.

I can see adding an option to fail if we exceed the max (since we expect a bug in that case, or a totally giant input), and some users may want it, but I would err on the side of running the job and not changing default behavior.

@isnotinvain
Copy link
Contributor Author

I'll update to make it non-fatal. I will also add a property to the hadoop config so that tools that monitor hadoop config properties can warn users that the max has been applied.

@isnotinvain
Copy link
Contributor Author

OK, this should be good to go. LMK if 5k is not a good default for the cap

@@ -179,14 +195,27 @@ object ReducerEstimatorStepStrategy extends FlowStepStrategy[JobConf] {
val info = FlowStrategyInfo(flow, preds.asScala, step)

// if still None, make it '-1' to make it simpler to log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still valid? Move it down maybe before you log it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be moved to below where it become -1

@sriramkrishnan
Copy link
Collaborator

LGTM - may want @rubanm take a look at it too.

""".stripMargin)
}

n.min(configuredMax)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go in an if-else block above?

@rubanm
Copy link
Contributor

rubanm commented Apr 4, 2016

One minor comment, LGTM!

@isnotinvain
Copy link
Contributor Author

I will merge once the tests run

@isnotinvain isnotinvain merged commit 315f9c0 into develop Apr 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants