From 07901fd85e56b9667d60c8681eff826a7ae50040 Mon Sep 17 00:00:00 2001 From: Gerard Maas Date: Tue, 27 May 2014 18:31:24 +0200 Subject: [PATCH 1/3] [WIP] adds info on how to create a job and the named RDD feature --- README.md | 36 +++++++++++++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 30ed5a9c0..5e8e94fcc 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ We deploy our job server off of this repo at Ooyala and it is tested against CDH - Asynchronous and synchronous job API. Synchronous API is great for low latency jobs! - Works with Standalone Spark as well as Mesos - Job and jar info is persisted via a pluggable DAO interface +- Named RDDs allow jobs to store and retrieve RDDs by name, improving RRD sharing and reuse among jobs. ## Quick start / development mode @@ -106,10 +107,43 @@ In your `build.sbt`, add this to use the job server jar: resolvers += "Ooyala Bintray" at "http://dl.bintray.com/ooyala/maven" - libraryDependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided" + libraryDependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided" For most use cases it's better to have the dependencies be "provided" because you don't want SBT assembly to include the whole job server jar. +To create a job that can be submitted through the job server, the job must implement the `SparkJob` trait. +Your job will look like: + + object SampleJob extends SparkJob { + override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? + override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? + } + +`runJob` contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method. +This releaves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to +manage and re-use contexts. +`validate` allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning `spark.jobserver.SparkJobValid` will let the job execute, while returning `spark.jobserver.SparkJobInvalid(reason)` prevents the job from running and provides means to convey the reason of failure. This way, you prevent running jobs that will eventually fail due to missing or wrong configuration and save both time and resources. + +### Using Named RDDs +Named RDDs are a way to easily share RDDs among job. Using this facility, computed RDDs can be cached with a given name and later on retrieved. +To use this feature, the SparkJob needs to mixin `NamedRddSupport`: + + object SampleNamedRDDJob extends SparkJob with NamedRddSupport { + override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? + override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? + } + +Then in the implementation of the job, RDDs can be stored: + + this.namedRdds.update("french_dictionary", frenchDictionaryRDD) + +While other job can retrieve this RDD later on: + + val rdd = this.namedRdds.get[(String, String)]("french_dictionary").get + +(note the explicit type provided to get. This will allow to cast the retrieved RDD that otherwise is of type RDD[_]) + + ## Deployment 1. Copy `config/local.sh.template` to `.sh` and edit as appropriate. From 349b9af8d7318cf852432b180d1548ae6d391808 Mon Sep 17 00:00:00 2001 From: Gerard Maas Date: Tue, 27 May 2014 18:35:10 +0200 Subject: [PATCH 2/3] Markdown style changes --- README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 5e8e94fcc..f51da7dd4 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ We deploy our job server off of this repo at Ooyala and it is tested against CDH - Asynchronous and synchronous job API. Synchronous API is great for low latency jobs! - Works with Standalone Spark as well as Mesos - Job and jar info is persisted via a pluggable DAO interface -- Named RDDs allow jobs to store and retrieve RDDs by name, improving RRD sharing and reuse among jobs. +- Named RDDs to cache and retrieve RDDs by name, improving RRD sharing and reuse among jobs. ## Quick start / development mode @@ -114,15 +114,15 @@ For most use cases it's better to have the dependencies be "provided" because yo To create a job that can be submitted through the job server, the job must implement the `SparkJob` trait. Your job will look like: - object SampleJob extends SparkJob { - override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? - override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? - } + object SampleJob extends SparkJob { + override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? + override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? + } -`runJob` contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method. -This releaves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to +- `runJob` contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method. + This releaves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to manage and re-use contexts. -`validate` allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning `spark.jobserver.SparkJobValid` will let the job execute, while returning `spark.jobserver.SparkJobInvalid(reason)` prevents the job from running and provides means to convey the reason of failure. This way, you prevent running jobs that will eventually fail due to missing or wrong configuration and save both time and resources. +- `validate` allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning `spark.jobserver.SparkJobValid` will let the job execute, while returning `spark.jobserver.SparkJobInvalid(reason)` prevents the job from running and provides means to convey the reason of failure. This way, you prevent running jobs that will eventually fail due to missing or wrong configuration and save both time and resources. ### Using Named RDDs Named RDDs are a way to easily share RDDs among job. Using this facility, computed RDDs can be cached with a given name and later on retrieved. From b7bc3cc25d41f3a7a2ac9fd4e2ecfbc486524345 Mon Sep 17 00:00:00 2001 From: Gerard Maas Date: Thu, 5 Jun 2014 23:00:48 +0200 Subject: [PATCH 3/3] review updated docs with NamedRDDs info --- README.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index f51da7dd4..7cc48cf74 100644 --- a/README.md +++ b/README.md @@ -114,10 +114,10 @@ For most use cases it's better to have the dependencies be "provided" because yo To create a job that can be submitted through the job server, the job must implement the `SparkJob` trait. Your job will look like: - object SampleJob extends SparkJob { - override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? - override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? - } + object SampleJob extends SparkJob { + override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? + override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? + } - `runJob` contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method. This releaves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to @@ -128,18 +128,18 @@ manage and re-use contexts. Named RDDs are a way to easily share RDDs among job. Using this facility, computed RDDs can be cached with a given name and later on retrieved. To use this feature, the SparkJob needs to mixin `NamedRddSupport`: - object SampleNamedRDDJob extends SparkJob with NamedRddSupport { - override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? - override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? - } + object SampleNamedRDDJob extends SparkJob with NamedRddSupport { + override def runJob(sc:SparkContext, jobConfig: Config): Any = ??? + override def validate(sc:SparkContext, config: Contig): SparkJobValidation = ??? + } -Then in the implementation of the job, RDDs can be stored: +Then in the implementation of the job, RDDs can be stored with a given name: - this.namedRdds.update("french_dictionary", frenchDictionaryRDD) + this.namedRdds.update("french_dictionary", frenchDictionaryRDD) -While other job can retrieve this RDD later on: +Other job running in the same context can retrieve and use this RDD later on: - val rdd = this.namedRdds.get[(String, String)]("french_dictionary").get + val rdd = this.namedRdds.get[(String, String)]("french_dictionary").get (note the explicit type provided to get. This will allow to cast the retrieved RDD that otherwise is of type RDD[_])