Skip to content

Commit

Permalink
Merge pull request #1 from maasg/NamedRDD_documentation
Browse files Browse the repository at this point in the history
Adds documentation of NamedRDD usage
  • Loading branch information
maasg committed Jun 5, 2014
2 parents b600cd1 + b7bc3cc commit acaf6a1
Showing 1 changed file with 35 additions and 1 deletion.
36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ We deploy our job server off of this repo at Ooyala and it is tested against CDH
- Asynchronous and synchronous job API. Synchronous API is great for low latency jobs!
- Works with Standalone Spark as well as Mesos
- Job and jar info is persisted via a pluggable DAO interface
- Named RDDs to cache and retrieve RDDs by name, improving RRD sharing and reuse among jobs.

## Quick start / development mode

Expand Down Expand Up @@ -106,10 +107,43 @@ In your `build.sbt`, add this to use the job server jar:

resolvers += "Ooyala Bintray" at "http://dl.bintray.com/ooyala/maven"

libraryDependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided"
libraryDependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided"

For most use cases it's better to have the dependencies be "provided" because you don't want SBT assembly to include the whole job server jar.

To create a job that can be submitted through the job server, the job must implement the `SparkJob` trait.
Your job will look like:

object SampleJob extends SparkJob {
override def runJob(sc:SparkContext, jobConfig: Config): Any = ???
override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ???
}

- `runJob` contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method.
This releaves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to
manage and re-use contexts.
- `validate` allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning `spark.jobserver.SparkJobValid` will let the job execute, while returning `spark.jobserver.SparkJobInvalid(reason)` prevents the job from running and provides means to convey the reason of failure. This way, you prevent running jobs that will eventually fail due to missing or wrong configuration and save both time and resources.

### Using Named RDDs
Named RDDs are a way to easily share RDDs among job. Using this facility, computed RDDs can be cached with a given name and later on retrieved.
To use this feature, the SparkJob needs to mixin `NamedRddSupport`:

object SampleNamedRDDJob extends SparkJob with NamedRddSupport {
override def runJob(sc:SparkContext, jobConfig: Config): Any = ???
override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ???
}

Then in the implementation of the job, RDDs can be stored with a given name:

this.namedRdds.update("french_dictionary", frenchDictionaryRDD)

Other job running in the same context can retrieve and use this RDD later on:

val rdd = this.namedRdds.get[(String, String)]("french_dictionary").get

(note the explicit type provided to get. This will allow to cast the retrieved RDD that otherwise is of type RDD[_])


## Deployment

1. Copy `config/local.sh.template` to `<environment>.sh` and edit as appropriate.
Expand Down

0 comments on commit acaf6a1

Please sign in to comment.