Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ToolJob: support for Jobs with integrated Tool. #242

Open
johnynek opened this issue Dec 11, 2012 · 3 comments
Open

ToolJob: support for Jobs with integrated Tool. #242

johnynek opened this issue Dec 11, 2012 · 3 comments

Comments

@johnynek
Copy link
Collaborator

The normal use case of building a scalding job involves writing a class that subclasses Job. Then, this class is rendered as a cascading flow by the scalding.Tool. There are two issues with this: 1) reflection is normally used to launch the job, and any error in the job that throws at constructor time, is generally hidden from the user as it is a reflection failure. 2) For the use case of people building stand-alone jobs, this needlessly complicates their build, as they have to launch with a special redundant string.

My idea is the ability to do something like:

object MyNewJob extends App with ToolJob {
  // args is the raw input passed in provided by App
  TypedTsv[String](parsedArgs("input"))
    .mapTo('words) { _.split("\\s+") }
    .groupBy('words) { _.size }
    .write(Tsv(parsedArgs("out")))
}

And then be able to run that with ```hadoop jar MyJar.jar --input infile --out wordcount.tsv" and have it bake in the default main method correctly.

@johnynek
Copy link
Collaborator Author

Alternatively, we could implement something equivalent to App, that doesn't introduce the args confusion (i.e. args will return a scalding.Args), and you just type object MyJob extends ToolJob.

@avibryant
Copy link
Contributor

I think I'm missing some context here. What would a ToolJob look like?

@johnynek
Copy link
Collaborator Author

So, being the scenes it is going to run the code here:

https://github.com/twitter/scalding/blob/develop/src/main/scala/com/twitter/scalding/Tool.scala

Except it isn't going to instantiate a new job to populate the flowDef, it will do so in the body of the object.

Maybe it is not helpful. Just an idea. Not really sure who the customer is, so maybe we can wait for someone to say this is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants