diff --git a/README.md b/README.md index 30ed5a9c0..7cc48cf74 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ We deploy our job server off of this repo at Ooyala and it is tested against CDH - Asynchronous and synchronous job API. Synchronous API is great for low latency jobs! - Works with Standalone Spark as well as Mesos - Job and jar info is persisted via a pluggable DAO interface +- Named RDDs to cache and retrieve RDDs by name, improving RRD sharing and reuse among jobs. ## Quick start / development mode @@ -106,10 +107,43 @@ In your `build.sbt`, add this to use the job server jar: resolvers += "Ooyala Bintray" at "http://dl.bintray.com/ooyala/maven" - libraryDependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided" + libraryDependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided" For most use cases it's better to have the dependencies be "provided" because you don't want SBT assembly to include the whole job server jar. +To create a job that can be submitted through the job server, the job must implement the `SparkJob` trait. +Your job will look like: + + object SampleJob extends SparkJob { + override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? + override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? + } + +- `runJob` contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method. + This releaves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to +manage and re-use contexts. +- `validate` allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning `spark.jobserver.SparkJobValid` will let the job execute, while returning `spark.jobserver.SparkJobInvalid(reason)` prevents the job from running and provides means to convey the reason of failure. This way, you prevent running jobs that will eventually fail due to missing or wrong configuration and save both time and resources. + +### Using Named RDDs +Named RDDs are a way to easily share RDDs among job. Using this facility, computed RDDs can be cached with a given name and later on retrieved. +To use this feature, the SparkJob needs to mixin `NamedRddSupport`: + + object SampleNamedRDDJob extends SparkJob with NamedRddSupport { + override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? + override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? + } + +Then in the implementation of the job, RDDs can be stored with a given name: + + this.namedRdds.update("french_dictionary", frenchDictionaryRDD) + +Other job running in the same context can retrieve and use this RDD later on: + + val rdd = this.namedRdds.get[(String, String)]("french_dictionary").get + +(note the explicit type provided to get. This will allow to cast the retrieved RDD that otherwise is of type RDD[_]) + + ## Deployment 1. Copy `config/local.sh.template` to `.sh` and edit as appropriate.