Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

CascadeJob #75

Closed
wants to merge 3 commits into from

2 participants

@PlasticLizard

This is a very simple kind of Job for aggregating other Scalding jobs into a Cascade. The main motiviation for this was the desire to run parallel scalding jobs in a single Amazon EMR Job Flow. Based on discussions on the list, this was the recommended approach.

Testing Note: I could not figure out how to test the "run" method because the TestJob class doesn't seem to call "run"- so I am happy to add a test for this, but I will need some help. However, I have tested this class in practice both in Local Mode and on Amazon EMR and it works fine with real jobs.

Finally, there are two ways to use the new class, inheriting from it or via args

class MyAggregateJob(args : Args) extends CascadeJob(args) { addJobs(new JobA(args), new JobB(args) }

or

scald.rb com.twitter.scalding.CascadeJob --hdfs --jobs JobA JobB

Thanks,

Nathan

src/main/scala/com/twitter/scalding/Job.scala
@@ -200,3 +202,41 @@ class ScriptJob(cmds: Iterable[String]) extends Job(Args("")) {
}
}
}
+
+/*
+* Composes the provided Jobs into a Cascade
+ (http://docs.cascading.org/cascading/2.0/javadoc/cascading/cascade/Cascade.html)
+*/
+class CascadeJob(args : Args) extends Job(args) {
+
+ private var jobList : List[Job] = List()
+ def jobs = jobList
+
+ /*This job type can be used to aggregate jobs
+ specified using the --jobs command line paramter,
+ for example:
+ scald.rb CascadeJob --hdfs --jobs JobOne JobTwo --outdir some/path
@johnynek Collaborator
johnynek added a note

Could you add some code to do something like:

--JobOne:outdir j1out --JobTwo:outdir j2out

So, any args prefixed with ":" would only be passed (with the jobname removed) to one of the particular jobs? This is useful because there could be collisions in the names of the args (for instance output or outdir is probably going to be common). This gives some namespacing.

so, something like:

def argsFor(jobname : String, allArgs : Args) : Args = {
//go over the map of arg names and only keep ones without ":" in the name or that start with "jobname:" but cut off "jobname:"
}

The Args class does not expose its internal hash. So I can either subclass Args and have a CascadeJobArgs, or I can modify the Args class to expose it internal map. Or, I could add an argsFor(prefix : String) method to Args that uses the above logic, basically returning the subset of the Args map that is either not prefixed at all or is prefixed with the provided prefix. What is your preference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/main/scala/com/twitter/scalding/Job.scala
((4 lines not shown))
+
+/*
+* Composes the provided Jobs into a Cascade
+ (http://docs.cascading.org/cascading/2.0/javadoc/cascading/cascade/Cascade.html)
+*/
+class CascadeJob(args : Args) extends Job(args) {
+
+ private var jobList : List[Job] = List()
+ def jobs = jobList
+
+ /*This job type can be used to aggregate jobs
+ specified using the --jobs command line paramter,
+ for example:
+ scald.rb CascadeJob --hdfs --jobs JobOne JobTwo --outdir some/path
+ */
+ args.list("jobs").foreach( addJob( _, args ) )
@johnynek Collaborator
johnynek added a note

couldn't we instead do:

val jobList = args.list("jobs").map { jobname => Job(jobname, argsFor(jobname, args)) }

Then I think you can make jobList immutable and public, and you can also remove the other methods below. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/test/scala/com/twitter/scalding/CascadeJobTest.scala
((9 lines not shown))
+}
+
+class CascadeJobTest extends Specification with TupleConversions {
+ "A CascadeJob" should {
+ val newJobs = Array(new Job1(Args("")), new Job2(Args("")))
+
+ "add jobs via addJobs (via addJob)" in {
+ val cascadeJob = new CascadeJob(Args("")){
+ addJobs(newJobs:_*)
+ }
+ cascadeJob.jobs must be_==(newJobs.toList.reverse)
+ }
+
+ "add jobs via reflective addJob" in {
+ val cascadeJob = new CascadeJob(Args("")){
+ addJob("com.twitter.scalding.Job1", Args("--hi ho"))
@johnynek Collaborator
johnynek added a note

I think we could do this instead with a CascadeJob companion object so you don't need to make CascadeJob mutable.

So:

object CascadeJob {
apply( jobs : (String, Args)*)
}

Then: CascadeJob(("job1", Args("--test1")), ("job2", Args("--test2")))

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/test/scala/com/twitter/scalding/CascadeJobTest.scala
@@ -0,0 +1,35 @@
+package com.twitter.scalding
+
+import org.specs._
+
+class Job1(args : Args) extends Job(args) {
+}
+
+class Job2(args : Args) extends Job(args) {
+}
+
+class CascadeJobTest extends Specification with TupleConversions {
@johnynek Collaborator
johnynek added a note

It would be nice, as you mentioned, to actually run a test job (one that we already have, for instance).

You are right, you will need to change this line:

https://github.com/twitter/scalding/blob/master/src/main/scala/com/twitter/scalding/JobTest.scala#L103

to call .run

I don't know why it doesn't now. I can't think of a reason for that. If you do that, I think this will work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@PlasticLizard

That all seems reasonable, I'll make the suggested changes.

@PlasticLizard

After using this a bit more extensively I noticed a few edge cases that made me wonder if this is something you want in your DSL, or if it would be better just being presented a Wiki page.

First, Scalding calls a function "validateSources" when preparing a job flow. If a cascade has been built with jobs that depend on one another, the sources may not be valid at job flow creation time. So this method has to be disabled when a job is run in a cascade. This could be accomplished with a public property "isInCascade" that gets set by CascadeJob and is used by child jobs to determine if they should validate.

Second, when a job is executed by Cascading as part of a cascade, the "next" method obviously doesn't get called. This makes sense once you think about it, but I think it would be surprising to people. This seems like a bigger problem to me.

Finally, and this probably a minor point, I can't make a test pass using the CascadeJob to save my life. I get errors from Cascading about creating cycles and loops that I don't have the expertise (or time) to track down.

Let me know what you think. I'm happy to deal with the validation issue and continue with the pull request, or I'm happy to write a quick wiki page on how to execute jobs in a Cascade using user level code, whatever makes the most sense for your project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 5, 2012
  1. @PlasticLizard

    Add CascadeJob

    PlasticLizard authored
  2. @PlasticLizard
Commits on May 7, 2012
  1. @PlasticLizard

    rework

    PlasticLizard authored
This page is out of date. Refresh to see the latest.
View
39 src/main/scala/com/twitter/scalding/Job.scala
@@ -17,6 +17,8 @@ package com.twitter.scalding
import cascading.flow.{Flow, FlowDef, FlowProps}
import cascading.pipe.Pipe
+import cascading.cascade.{Cascade, CascadeConnector}
+import cascading.stats.FlowStats
//For java -> scala implicits on collections
@@ -200,3 +202,40 @@ class ScriptJob(cmds: Iterable[String]) extends Job(Args("")) {
}
}
}
+
+object CascadeJob {
+ def apply( jobList : Job* ) = new CascadeJob { override val jobs = jobList }
+}
+
+/*
+* Composes the provided Jobs into a Cascade
+ (http://docs.cascading.org/cascading/2.0/javadoc/cascading/cascade/Cascade.html)
+*/
+class CascadeJob(args : Args = Args("")) extends Job(args) {
+
+ /*This job type can be used to aggregate jobs
+ specified using the --jobs command line paramter,
+ for example:
+
+ scald.rb CascadeJob --hdfs --jobs JobOne JobTwo --outdir some/path
+
+ To specify a job-specific argument, prefix the argument name, you can
+ specify arguments prefixed with the job name:
+
+ scald.rb CascadeJob --hdfs --jobs JobOne JobTwo --JobOne:input j1i.csv --JobTwo:input j2i.csv
+ */
+ val jobs : Iterable[Job] = args.list("jobs").map { jobname => Job(jobname, argsFor(jobname, args)) }
+
+ override def run(implicit mode : Mode) = {
+ val flows = jobs.map( _.buildFlow(mode) )
+ val cascade = new CascadeConnector().connect ( flows.toSeq:_* )
+ cascade.complete()
+ cascade.getCascadeStats.getChildren.toSeq.forall {
+ _.asInstanceOf[FlowStats].isSuccessful
+ }
+ }
+
+ def argsFor(jobName : String, parentArgs : Args) = {
+ parentArgs
+ }
+}
View
3  src/main/scala/com/twitter/scalding/JobTest.scala
@@ -100,7 +100,8 @@ class JobTest(jobName : String) extends TupleConversions {
@tailrec
private final def runJob(job : Job, runNext : Boolean) : Unit = {
- job.buildFlow.complete
+ //job.buildFlow.complete
+ job.run
val next : Option[Job] = if (runNext) { job.next } else { None }
next match {
case Some(nextjob) => runJob(nextjob, runNext)
View
2  src/main/scala/com/twitter/scalding/MemoryTap.scala
@@ -30,7 +30,7 @@ class MemoryTap[In,Out](val scheme : Scheme[Properties,In,Out,_,_], val tupleBuf
override def deleteResource(conf : Properties) = true
override def resourceExists(conf : Properties) = true
override def getModifiedTime(conf : Properties) = 1L
- override def getIdentifier() : String = scala.math.random.toString
+ override def getIdentifier() : String = { val id = scala.math.random.toString; println("I am " + id); id}
override def openForRead(flowProcess : FlowProcess[Properties], input : In) = {
new TupleEntryChainIterator(scheme.getSourceFields, tupleBuffer.toIterator)
View
37 src/test/scala/com/twitter/scalding/CascadeJobTest.scala
@@ -0,0 +1,37 @@
+package com.twitter.scalding
+
+import org.specs._
+
+class Job1(args : Args) extends Job(args) {
+ Tsv("in1").read.write(Tsv("output1"))
+}
+
+class Job2(args : Args) extends Job(args) {
+ Tsv("in2").read.write(Tsv("output2"))
+}
+
+class CascadeJobTest extends Specification with TupleConversions {
+ "Instantiating a CascadeJob" should {
+ val newJobs = List(new Job1(Args("")), new Job2(Args("")))
+
+ "add jobs via arguments" in {
+ val cascadeJob = new CascadeJob(Args("--jobs com.twitter.scalding.Job1 com.twitter.scalding.Job2"))
+ cascadeJob.jobs.map(_.name) must be_==(List("com.twitter.scalding.Job1","com.twitter.scalding.Job2"))
+ }
+
+ "add jobs via factory method" in {
+ val cascadeJob = CascadeJob(newJobs:_*)
+ cascadeJob.jobs.map(_.name) must be_==(List("com.twitter.scalding.Job1","com.twitter.scalding.Job2"))
+ }
+ }
+
+ // "Running a CascadeJob" should {
+
+ // JobTest("com.twitter.scalding.CascadeJob")
+ // .arg("jobs", List("com.twitter.scalding.Job1", "com.twitter.scalding.Job2"))
+ // .run
+ // .finish
+
+
+ // }
+}
Something went wrong with that request. Please try again.