hRaven Reducer Estimator #996

bholt · 2014-08-04T18:03:22Z

Adds new "scalding-hraven" module with an estimator that looks up job history in hRaven to determine the actual amount of bytes reaching reducers. The HRavenHistory trait can be added to a ReducerEstimator to add functionality for looking up job history, and the RatioBasedReducerEstimator does this, scaling the estimate provided by InputSizeReducerEstimator based on the past ratio between mapper/reducer input sizes.

This PR also includes a couple other minor changes:

Using a case class as argument to estimateReducers to hold the common reducer estimator info (flow,step,predecessorSteps)
Add optional fallbackEstimator which lets estimators be chained

So far I don't have a story for unit testing hRaven. Any ideas are welcome.

johnynek · 2014-08-04T18:25:39Z

project/Build.scala

@@ -256,6 +257,18 @@ object ScaldingBuild extends Build {
      "org.scala-tools.testing" %% "specs" % "1.6.9" % "test"
    )
  ).dependsOn(scaldingCore)
+
+  lazy val scaldingHRaven = module("hraven").settings(


you probably need to update the .travis.yml now since all the tests run in separate test jobs (due to them getting too large for Travis to deal). Chat with @ianoc

johnynek · 2014-08-04T18:30:23Z

...-core/src/main/scala/com/twitter/scalding/reducer_estimation/InputSizeReducerEstimator.scala

+  protected def totalSize(taps: Iterator[Tap[_, _, _]], conf: JobConf): Option[Long] =
+    taps.foldLeft(Option(0L)) {
+      // recursive case
+      case (Some(total), multi: MultiSourceTap[Tap[_, _, _], _, _]) =>


good catch. I guess we forgot this case.

Yeah, this worried me — first analytics batch job I ran exposed this. Are there any other cases you can think of that don't implement Hfs that may come up?

None that I know of.

On Mon, Aug 4, 2014 at 10:49 AM, Brandon Holt notifications@github.com
wrote:

In
scalding-core/src/main/scala/com/twitter/scalding/reducer_estimation/InputSizeReducerEstimator.scala:

val fs = f.getPath.getFileSystem(conf) fs.globStatus(f.getPath) .map{ s => fs.getContentSummary(s.getPath).getLength } .sum

}

protected def totalSize(taps: Iterator[Tap[_, _, _]], conf: JobConf): Option[Long] =

taps.foldLeft(Option(0L)) {

// recursive case

case (Some(total), multi: MultiSourceTap[Tap[_, _, _], _, _]) =>

Yeah, this worried me — first analytics batch job I ran exposed this. Are
there any other cases you can think of that don't implement Hfs that may
come up?

—
Reply to this email directly or view it on GitHub
https://github.com/twitter/scalding/pull/996/files#r15780033.

Oscar Boykin :: @posco :: http://twitter.com/posco

johnynek · 2014-08-05T00:31:13Z

...ing-hraven/src/main/scala/com/twitter/scalding/hraven/reducer_estimation/HRavenHistory.scala

+  import HRavenHistory.jobConfToRichConfig
+
+  private final val apiHostnameKey = "hraven.api.hostname"
+  private final val hRavenClientConnectTimeout = 30000


these could be defaults that also could be configured in the conf, right?

…oryService

johnynek · 2014-08-07T01:09:06Z

scalding-core/src/main/scala/com/twitter/scalding/reducer_estimation/Common.scala

-    val numReducers = estimateReducers(flow, predecessorSteps, flowStep).getOrElse(0)
+    val estimators = Option(conf.get(Config.ReducerEstimators))
+      .map(_.split(",")).flatten
+      .map(Class.forName(_).newInstance.asInstanceOf[ReducerEstimator])


can you use the forName with getCurrentContext? That seems to be correct thing (or at least what cascading and hadoop do in several places) to get the classloader for the current thread.

…flow step strategy

johnynek · 2014-08-07T22:27:46Z

scalding-core/src/main/scala/com/twitter/scalding/reducer_estimation/Common.scala

+ * @param mapperBytes   Input to mappers (in bytes)
+ * @param reducerBytes  Input to reducers (in bytes)
+ */
+case class FlowStepHistory(mapperBytes: Long, reducerBytes: Long)


Should this also have a RichDate for when that job ran?

One idea would be to use a sealed trait here so we can add more to it later without breaking the code. We can still have some apply methods that create these so FlowStepHistory(10L, 1L) would still work.

Cool, I didn't know what sealed traits did. Seems like a plan.

So far, I haven't added RichDate. Is it a big deal for someone to add it whenever they need it?

… histories

…rameter

johnynek · 2014-08-08T22:13:20Z

looks good except for one minor point.

bholt · 2014-08-08T22:21:40Z

I'm currently testing this in Science, and it looks like I'll have an additional change or two, so don't merge yet.

johnynek · 2014-08-08T23:18:32Z

scalding-core/src/main/scala/com/twitter/scalding/Config.scala

+   * Prepend an estimator so it will be tried first. If it returns None,
+   * the previously-set estimators will be tried in order.
+   */
+  def addReducerEstimator[T](clsName: String): Config =


just noticed this method does not really depend on T. Scalac should probably warn or error on unused type parameters.

… estimator strategy

johnynek · 2014-08-11T18:23:41Z

If this is ready, it looks good to me. Brandon if you are ready to pull the trigger, go for it.

I think we may be getting close to scalding 0.12.0 here.

hRaven Reducer Estimator and a bunch of other reducer_estimation refactoring

bholt · 2014-08-11T18:29:54Z

Thanks for all the feedback and suggestions, @johnynek.

Brandon Holt added 3 commits August 1, 2014 17:29

add scalding-hraven module, hRaven-based reducer estimation

46370cf

add missing hbase dependency

0f0c3f2

don't double-define Config params, clean up imports, comments

a2f0799

bholt changed the title ~~hRaven Reducer Estimator and other minor estimator fixes~~ hRaven Reducer Estimator Aug 4, 2014

johnynek reviewed Aug 4, 2014
View reviewed changes

Merge branch 'develop' into bholt/hraven-reducer-estimator

213f729

johnynek reviewed Aug 4, 2014
View reviewed changes

Brandon Holt added 2 commits August 4, 2014 11:46

run hraven tests on travis (if we end up having any)

9f5b152

get api hostname from config, other minor changes from feedback

a0dd600

johnynek reviewed Aug 5, 2014
View reviewed changes

Brandon Holt added 5 commits August 5, 2014 18:50

refactor into generic HistoryService/RatioBasedReducer and HRavenHist…

77a2486

…oryService

add hRaven client config params, some comments

fc9a2b5

make 'fetchPastJobDetails' return 'Try'

3cd483d

Merge branch 'develop' into bholt/hraven-reducer-estimator

e96ec7e

make ReducerEstimator a monoid

40d502f

johnynek reviewed Aug 7, 2014
View reviewed changes

Brandon Holt added 2 commits August 7, 2014 11:23

use thread classloader, remove fallback estimator, conditionally set …

84c0e10

…flow step strategy

use 'Config.addReducerEstimator' in test

366fc91

Brandon Holt added 3 commits August 7, 2014 12:06

catch possible IOException when fetching flows

e897398

fix compile error in estimator instantiation

af8a740

allow fetching multiple history entries

8cd9c53

johnynek reviewed Aug 7, 2014
View reviewed changes

Brandon Holt added 4 commits August 8, 2014 09:36

move RatioBasedEstimator to its own file and make it average multiple…

1c95305

… histories

make threshold a config parameter, make maxHistory a shared config pa…

3bb68b8

…rameter

make FlowStepHistory a sealed trait so it can be extended in the future

f58a327

wrap everything in trys

deb3867

use Seq instead of varargs in MissingFieldException

8d2e2d0

johnynek reviewed Aug 8, 2014
View reviewed changes

Brandon Holt added 4 commits August 8, 2014 17:12

minor fixes, most notably: check if hadoop mode before adding reducer…

6d044a8

… estimator strategy

automatically add username to config

f5e80e6

fix how ExecutionContext handles no-estimator case

7e8b709

better to just match on the option in the first place

2adfee8

bholt added a commit that referenced this pull request Aug 11, 2014

Merge pull request #996 from twitter/bholt/hraven-reducer-estimator

6b64e3e

hRaven Reducer Estimator and a bunch of other reducer_estimation refactoring

bholt merged commit 6b64e3e into develop Aug 11, 2014

bholt deleted the bholt/hraven-reducer-estimator branch August 11, 2014 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hRaven Reducer Estimator #996

hRaven Reducer Estimator #996

bholt commented Aug 4, 2014

johnynek Aug 4, 2014

johnynek Aug 4, 2014

bholt Aug 4, 2014

johnynek Aug 4, 2014

johnynek Aug 5, 2014

johnynek Aug 7, 2014

johnynek Aug 7, 2014

bholt Aug 8, 2014

johnynek commented Aug 8, 2014

bholt commented Aug 8, 2014

johnynek Aug 8, 2014

bholt Aug 8, 2014

johnynek commented Aug 11, 2014

bholt commented Aug 11, 2014

hRaven Reducer Estimator #996

hRaven Reducer Estimator #996

Conversation

bholt commented Aug 4, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Aug 8, 2014

bholt commented Aug 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Aug 11, 2014

bholt commented Aug 11, 2014