feat(jobserver): Replace manager_start with SparkLauncher #982

bsikander · 2017-11-30T16:31:42Z

Idea

Move variables from manager_start and server_start to setenv.sh
Use set -a in server_start.sh to expose all the variables to environment on SJS startup.
These variables are then used to launch driver JVM.
Move variables like jar file, conf etc from manager_start.sh to setenv.sh to make it more
configurable
Replace spawning of manager_start in AkkaClusterSupervisor with Launcher

Testing

This change was tested with client and cluster mode (standalone) and seems to work fine.
Can somebody please verify it with Mesos/Yarn?

Pull Request checklist

The commit(s) message(s) follows the contribution guidelines ?
Tests for the changes have been added (for bug fixes / features) ?
Docs have been added / updated (for bug fixes / features) ?

Current behavior :
Manager_start.sh script is used to spawn driver JVMs. To get the current status of job from Spark, the only way is to use the History server. Which is not a good approach.

New behavior :
With SparkLauncher, we can use the SparkAppHandle to access the current state (WAITING/KILLED etc) of job.

BREAKING CHANGES
This change works with Spark 2.1.0 and above. This is due to the redirectOutput and redirectError functions which were introduced in 2.1.0.

Other information:
If this change gets through, then server_start.sh will also be replaced with launcher and then another change will introduce the changes related to accessing the current state of job in Spark.

This change is

* Move variables from manager_start and server_start to setenv * Use set -a in server_start.sh to expose all the variables in env. These variables are then used to launch driver JVM. * Move variables like jar file, conf etc to setenv to make it more configurable * Replace spawning of manager_start in AkkaClusterSupervisor to Launcher * Add abstract class Launcher to hold common variables and functions * Add concrete implementation for manager with manager specific properties * Update documentation

codecov-io · 2017-11-30T16:53:36Z

Codecov Report

Merging #982 into master will decrease coverage by 1.99%.
The diff coverage is 5.79%.

@@           Coverage Diff           @@
##           master     #982   +/-   ##
=======================================
- Coverage   72.48%   70.49%   -2%     
=======================================
  Files          77       78    +1     
  Lines        2410     2437   +27     
  Branches      128      212   +84     
=======================================
- Hits         1747     1718   -29     
- Misses        663      719   +56

Impacted Files	Coverage Δ
...a/spark/jobserver/AkkaClusterSupervisorActor.scala	`0% <0%> (ø)`	⬆️
...n/scala/spark/jobserver/util/ManagerLauncher.scala	`0% <0%> (ø)`
...src/main/scala/spark/jobserver/util/Launcher.scala	`0% <0%> (ø)`
...ain/scala/spark/jobserver/util/SparkJobUtils.scala	`97.36% <80%> (-2.64%)`	⬇️
...in/scala/spark/jobserver/cassandra/Cassandra.scala	`42.85% <0%> (-7.15%)`	⬇️
...-api/src/main/scala/spark/jobserver/JobCache.scala

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9efae97...532946d. Read the comment docs.

bsikander · 2017-12-01T18:15:10Z

@derSascha can you check this PR with Yarn?
@velvia can you have a look at this ?

ecandreev · 2017-12-06T14:01:43Z

I tried it for cluster mode, works for me.

velvia · 2017-12-07T08:11:01Z

README.md

+MANAGER_LOGGING_OPTS="-Dlog4j.configuration=$REMOTE_JOBSERVER_DIR/log4j-cluster.properties"
+```
+
+  - Cluster mode for mesos/yarn


I think it's better to have a separate example for each of mesos and yarn

Ok. I will add separate examples.

velvia

This looks fine for the most part, thanks, and is a great change. What environments have you tested this under?

bsikander · 2017-12-07T08:31:39Z

I have tested it with

local[*]
spark cluster with client and cluster mode

Testing on Mesos/Yarn is required.

bsikander · 2017-12-11T09:12:19Z

@velvia I added separate examples for Yarn/Mesos.

thoHeinze · 2017-12-12T21:30:45Z

@derSascha Any change you can take a look on this PR ?

derSascha · 2017-12-13T11:27:02Z

@bsikander thanks for the improvement. i don't have the time right now to test the cluster mode, but tested the yarn client mode. Some points i changed while testing around / suggestions / ideas:

Move the MANAGER_... variables into the settings.sh file and set some defaults in the setenv.sh. Some examples should be added to the templates in config.
Replace this lines with this, otherwise loading of settings.sh failes if the variables are moved.
Move the examples from the main README into the doc/cluster,mesos,yarn docs.
It would be nice to set driver-memory and driver-cores via context settings (see feat(jobserver): Pass cxt config to manager start #988). We can use spark.driver.memory and spark.driver.cores from the context config to set the args if present and use the JOBSERVER_MEMORY as fallback like already implemented.
We should look that other requirements from feat(jobserver): Pass cxt config to manager start #988 are fixed, because any special manager_start.sh code can't be implemented anymore. @addisonj any suggestions?
Can we use the SparkAppHandle to handle things like YARN app was killed via YARN?

Thanks for the pull-request using the launcher seems to be more stable way than the bash script workaround!

derSascha · 2017-12-13T11:53:38Z

What about something like this in the context configs:

spark.submit.args.conf = ["foo=bar", "c=d",...]

that is parsed and added to the arguments of the launcher.

addisonj · 2017-12-13T21:00:06Z

hrm... so I am torn on this... I think it is sort of hard to generalize on all the differences between how spark-standalone, yarn, and mesos work and manager_start.sh has been a decent (albeit hacky) way to handle those differences.

Mesos is particularly challenging due to the way it operates.

For example, when you are submitting a driver to be scheduled, the (i.e. manager_start.sh calling spark-submit) the --master` arg is the URI of the mesosClusterDispatcher, but once the JobManager starts, we need to change that master to be the mesos master (same as it would be in mesos client).

Another challenge is args like --files. In yarn, these files are copied from local to be available to each machine dynamically. However, in mesos, it is assumed all the URIs passed in files are available to each machine already. This meant that to get things to function properly, I had to create a nginx in front of SJS to serve my SJS jar and configuration file and things like MESOS_SANDBOX and spark.mesos.uris to get files where expected.

There are also a number of options which I would want to dynamically configure beyond just driver-memory and driver-cores. For example, spark 2.2.1 a new config is added, spark.mesos.driver.constraints which is only used when scheduling a driver that I would like to configure dynamically based on options passed in creating a context. It would be pretty annoying for any new options like that to require a code change.

I think @derSascha is on the right path with spark.submit.args.conf, so long as it could be set (and probably merged with static options) when creating a context and perhaps could also interpolate environment variables in sane way.

My proposal might be a new namespace of args:spark.manager

spark.manager would be where all arguments that are used in the SparkLauncher would be sourced. These arguments could either be set statically in the config or at runtime during context creation. For example, the master would be spark.manager.master would be the value used for master argument in spark launcher. However, it would have a default value of spark.master if not set, for sane fallback for simpler cases. This would allow for me to schedule dynamically into multiple mesosClusterDispatcher (which could be useful for determining which machines I run on) or could just be set statically. Assuming it isn't too crazy, it might simplify things to allow these values to reference other variables or even interpolate environment variables, for cases like the MESOS_SANDBOX env var which allows me to access spark.mesos.uris downloaded into the running container (but that might not be required).

Another example would be passing --conf options, spark.manager.conf would be a map of k/v pairs to pass as conf options, and these could all be configured statically or merged with runtime options from context creation.

Sorry for the wall of text, but for this to be flexible enough, I think we would need something like above. Assuming we get it right, I think it would be a lot better than manager_start.sh

bsikander · 2017-12-14T10:30:06Z

@derSascha thank you for the comments.

1- A new file named settings.sh should be added with all the specific settings ? Common settings will go to setenv.sh and all others to settings?
2- OK, I will add this
3- OK, I will add this
4- OK I will add these settings driver-memory and driver-cores
5- Regarding spark.submit.args.conf = ["foo=bar", "c=d",...], in the current PR you can set the key/value pairs in setenv.sh using MANAGER_EXTRA_SPARK_CONFS. User can specify key/value like

MANAGER_EXTRA_SPARK_CONFS="foo=bar|c=d"

In Launcher, I am already parsing this and adding the key/val as --conf.

6- Currently YARN killed event is handled through SparkListener, we can also implement it via SparkAppHandler but I don't see any benefit if I use SparkAppHandler.

derSascha · 2017-12-14T15:08:30Z

@bsikander

I mean the config/<environment>.sh that gets renamed to settings.sh while packaging and than sourced here
It would be nice to set this options on a per context basis including environment variables that are evaluated at runtime (see the post from @addisonj)
Oops, my fault. Ignore this point.

@addisonj
The --files thing sound like the Spark Standalone Cluster mode. In the current version, set REMOTE_JOBSERVER_DIR to some base URL in your settings.sh and the manager_start.sh should prefix all files with this URL.

About the config part: I am not the pro in scala config parsing and evaluation, but this sounds like a very important feature while removing manager_start.sh.

velvia · 2017-12-17T06:17:04Z

@addisonj @derSascha so let me summarize what you are saying to make sure I understand...

The Spark arguments, even such as spark.master, that the Launcher has to use (or equiv spark-submit params) maybe different than what the process once launched should use (what is actually passed by SJS ContextManager to each context process, as a config), thus we need a separate configuration aside from the simple spark.master stuff in the context config.

Is that right?

I kind of like spark.launcher better than spark.manager, because when I see manager I think of the JobManager which is our (SJS's) wrapper for starting a job in context-per-jvm.

bsikander · 2017-12-21T09:24:40Z

@velvia @derSascha @addisonj
Let's finalize, I agree with velvia spark.launcher is better. User can set any spark configuration like spark.proxy.user or spark.driver.cores or spark.driver.memory statically in config file or during POST /context. We combine both static and dynamic configs and use a loop to add all the configs (similar to passthrough section).

As a side note:
We can even do this change without spark.launcher section. We can just use spark section. We can put all the configs directly in this section and read the static config like

val test = testConfig.getConfig("spark")
                      .withoutPath("context-settings")
                      .withoutPath("contexts")
                      .withoutPath("jobserver")

Let's finish this because other changes are being blocked by this change.

velvia · 2017-12-21T15:49:40Z

bin/setenv.sh

-MANAGER_EXTRA_SPARK_CONFS=
-MANAGER_LOGGING_OPTS="-Dlog4j.configuration=file:$appdir/log4j-server.properties"
-SPARK_LAUNCHER_VERBOSE=0
+if [ -z "$MANAGER_JAR_FILE" ]; then


I believe you can do ${MANAGER_JAR_FILE:="$appdir/spark-job-server.jar"}

velvia · 2017-12-21T15:53:26Z

doc/yarn.md

+### Modifying the &lt;environment&gt;.sh script
+Replace `MANAGER_*` variables with
+```
+MANAGER_JAR_FILE="$appdir/spark-job-server.jar"


I wonder if it might be good to put these alternatives in setenv.sh as well, after all the Mesos one is there but YARN isn't.

Good point, but at this point, I don't have any variable which tells me the current spark.submit.deployMode. I do have a path to *.conf file and somehow within BASH, I can try to parse this. What do you say ?
Also @derSascha any comments?

velvia · 2017-12-21T15:59:56Z

@bsikander the changes look good to me.

There is currently the contextconfig section for defaults. Why can't we just pull the spark settings from there for the Launcher? Or is it because we cannot tell which ones need to be preset before calling the launcher? Rather than creating a whole new config section, I'd rather create a "launcher" section for launcher specific configs. For example

spark.context-settings.launcher.spark.driver.cores = ...

(if that makes sense)

bsikander · 2017-12-21T18:42:59Z

@velvia
Ok, as far as I know, we can add the launcher configs in spark.context-settings. One thing which is not clear is that should I keep the passthrough section, which is parsed like this ?

Configs defined in passthrough section are added to sparkConf when the JobManagerActor is already initialized (i.e. Driver JVM already started), all of these settings can also be added as --conf during the spark-submit/sparkLauncher. Should I just remove the passthrough section and add a launcher section which contains all the static configs and can be overriden by POST /context?

velvia · 2017-12-21T19:04:43Z

I’m fine getting rid of the passthrough section. I was thinking about if there were any settings which need to be initialized later, but can’t think of any. The simpler the better. Actually the ideal for the config is that everything in spark.context-settings which is NOT explicitly parsed can just be passed onto the launcher. I don’t want to change things up too much though. We’ll probably need to support the passthrough section for compatibility for a while.

…

On Dec 21, 2017, at 10:43 AM, Behroz Sikander ***@***.***> wrote: @velvia <https://github.com/velvia> Ok, as far as I know, we can add the launcher configs in spark.context-settings. One thing which is not clear is that should I keep the passthrough section <https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/resources/application.conf#L145>, which is parsed like this <https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server-api/src/main/scala/spark/jobserver/util/SparkJobUtils.scala#L98> ? Configs defined in passthrough section are added to sparkConf when the JobManagerActor is already initialized (i.e. Driver JVM already started), all of these settings can also be added as --conf during the spark-submit/sparkLauncher. Should I just remove the passthrough section and add a launcher section which contains all the static configs and can be overriden by POST /context? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#982 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABA320dsrXR5J0AekOxPcLXsUmRnfiFcks5tCqa1gaJpZM4Qw1aq>.

bsikander · 2018-01-24T13:56:27Z

Sorry for being away for some time, I was on holidays.

So, I have added a new section in config named spark.context-settings.launcher, user can add spark properties in this section and will be added to spark-submit by SparkLauncher.
e.g.

curl -X POST "localhost:8090/contexts/streaming?context-factory=spark.jobserver.context.StreamingContextFactory&launcher.spark.driver.cores=1&launcher.spark.driver.memory=512m&launcher.spark.master=spark://<IP>:7077"

Note: Now JOBSERVER_MEMORY environment variable is not used anymore for SparkLauncher to set spark.driver.memory. Instead spark.driver.memory is picked up from either application.conf or <environment>.conf

A few things that we need to take care of in future regarding spark.context-settings, user can add spark configurations in the following areas
1- Directly under spark.context-settings(e.g. spark.context-settings.spark.driver.memory)
2- Under spark.context-settings.passthrough
3- Under spark.context-settings.hadoop
4- Under spark e.g. spark.master

All the above settings are added to SparkConf like this.

Now, we have a new section for Launcher and properties can be set under spark.context-settings.launcher.

Since, sparkConf has more precedence than spark-submit, configurations set in launcher can be overriden at runtime and user can get confused.
For compatibility purpose, we can keep them for now but in future we should only have 2 sections for clarity.
1- spark.context-settings.launcher
2- spark.context-settings.conf

bsikander · 2018-01-28T15:58:42Z

@velvia @derSascha @addisonj can you have a look on my new commit?

bsikander · 2018-02-01T11:23:54Z

Can someone please have a look on this change?

bsikander · 2018-02-04T12:30:44Z

@velvia I have some other changes that I want to push on top of this change. Would be nice if you can have a look on this.

velvia · 2018-02-07T19:25:26Z

@bsikander the last commit looks fine, sorry I have been busy with work and away on holiday also. Going to merge it.

bsikander · 2018-02-07T19:54:18Z

Thank you merging :)

permanentstar · 2018-05-14T09:54:15Z

I encounter context start fail when use this feathure on yarn cluster mode
spark driver(different node with jobserver) throw out exception said Could "not find system configuration file"
contrast between v0.8.0 and master build, the master build miss conf file and log4j file
I didn't change any relative env setting in settings.sh
hdfs dfs -ls /user/hadoop/.sparkStaging/application_1524042746805_0147
Found 4 items
-rw-r--r-- 3 hadoop supergroup 32649 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/spark_conf.zip
-rw-r--r-- 3 hadoop supergroup 4060 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/cluster.conf
-rw-r--r-- 3 hadoop supergroup 620 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/log4j-cluster.properties
-rw-r--r-- 3 hadoop supergroup 40539377 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/spark-job-server.jar
hdfs dfs -ls /user/hadoop/.sparkStaging/application_1524042746805_0134
Found 2 items
-rw-r--r-- 3 hadoop supergroup 32260 2018-05-14 16:37 /user/hadoop/.sparkStaging/application_1524042746805_0134/spark_conf.zip
-rw-r--r-- 3 hadoop supergroup 40991132 2018-05-14 16:37 /user/hadoop/.sparkStaging/application_1524042746805_0134/spark-job-server.jar

Did I need some more configuration?

update:

I finally modify the code in spark.jobserver.util.ManagerLauncher and env variable to make it pass through on yarn cluster mode, can anyone check whether this has side effect, thank you.

var addFiles : Seq[String] = Seq(getEnvironmentVariable("MANAGER_CONF_FILE"))
      
val logConf = loggingOpts.split("""\s+""").map(o => o.split("="))
        .find(array => array(0).equals("-Dlog4j.configuration"))
      
if(logConf.isDefined){
        addFiles = addFiles :+ logConf.get(1)
      }
      
launcher.addSparkArg("--files", addFiles.mkString(","))

MANAGER_LOGGING_OPTS="-Dlog4j.configuration=log4j-cluster.properties"
JOBSERVER_CONFIG="cluster.conf"

bsikander added 2 commits November 30, 2017 17:17

fix(jobserver): Remove manager_start.sh and its dependencies

7a161fc

Remove unused code and do minor fix

8b0972f

velvia reviewed Dec 7, 2017

View reviewed changes

velvia approved these changes Dec 7, 2017

View reviewed changes

Add separate examples for YARN and Mesos

944d943

derSascha mentioned this pull request Dec 13, 2017

feat(jobserver): Pass cxt config to manager start #988

Closed

3 tasks

bsikander added 2 commits December 21, 2017 18:06

Move variables to settings.sh from setenv.sh

efed226

Move documentation to mesos/yarn/emr

5190bae

velvia reviewed Dec 21, 2017

View reviewed changes

Add option to change launcher spark properties dynamically

532946d

velvia merged commit cf5f1a2 into spark-jobserver:master Feb 7, 2018

thoHeinze mentioned this pull request Apr 6, 2018

Spark Driver Memory #822

Closed

thoHeinze mentioned this pull request Aug 20, 2018

Add missing manager_start.sh to docker file; make context-per-jvm def… #1109

Closed

3 tasks

bsikander mentioned this pull request May 1, 2019

Configure memory for mesos #694

Closed

bsikander mentioned this pull request Jun 7, 2019

unable to submit job to specific queue when using cluster deployment #999

Closed

bsikander mentioned this pull request Apr 3, 2020

Debug Driver and executor through passing spark.driver.extraJavaOptions in cluster #1290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(jobserver): Replace manager_start with SparkLauncher #982

feat(jobserver): Replace manager_start with SparkLauncher #982

bsikander commented Nov 30, 2017 •

edited

Loading

codecov-io commented Nov 30, 2017 •

edited

Loading

bsikander commented Dec 1, 2017

ecandreev commented Dec 6, 2017

velvia Dec 7, 2017

bsikander Dec 7, 2017

velvia left a comment

bsikander commented Dec 7, 2017

bsikander commented Dec 11, 2017

thoHeinze commented Dec 12, 2017

derSascha commented Dec 13, 2017

derSascha commented Dec 13, 2017

addisonj commented Dec 13, 2017

bsikander commented Dec 14, 2017

derSascha commented Dec 14, 2017

velvia commented Dec 17, 2017

bsikander commented Dec 21, 2017

velvia Dec 21, 2017 •

edited

Loading

velvia Dec 21, 2017

bsikander Dec 21, 2017

velvia commented Dec 21, 2017

bsikander commented Dec 21, 2017

velvia commented Dec 21, 2017 via email

bsikander commented Jan 24, 2018 •

edited

Loading

bsikander commented Jan 28, 2018

bsikander commented Feb 1, 2018

bsikander commented Feb 4, 2018

velvia commented Feb 7, 2018

bsikander commented Feb 7, 2018

permanentstar commented May 14, 2018 •

edited

Loading

feat(jobserver): Replace manager_start with SparkLauncher #982

feat(jobserver): Replace manager_start with SparkLauncher #982

Conversation

bsikander commented Nov 30, 2017 • edited Loading

Idea

Testing

codecov-io commented Nov 30, 2017 • edited Loading

Codecov Report

bsikander commented Dec 1, 2017

ecandreev commented Dec 6, 2017

velvia Dec 7, 2017

Choose a reason for hiding this comment

bsikander Dec 7, 2017

Choose a reason for hiding this comment

velvia left a comment

Choose a reason for hiding this comment

bsikander commented Dec 7, 2017

bsikander commented Dec 11, 2017

thoHeinze commented Dec 12, 2017

derSascha commented Dec 13, 2017

derSascha commented Dec 13, 2017

addisonj commented Dec 13, 2017

bsikander commented Dec 14, 2017

derSascha commented Dec 14, 2017

velvia commented Dec 17, 2017

bsikander commented Dec 21, 2017

velvia Dec 21, 2017 • edited Loading

Choose a reason for hiding this comment

velvia Dec 21, 2017

Choose a reason for hiding this comment

bsikander Dec 21, 2017

Choose a reason for hiding this comment

velvia commented Dec 21, 2017

bsikander commented Dec 21, 2017

velvia commented Dec 21, 2017 via email

bsikander commented Jan 24, 2018 • edited Loading

bsikander commented Jan 28, 2018

bsikander commented Feb 1, 2018

bsikander commented Feb 4, 2018

velvia commented Feb 7, 2018

bsikander commented Feb 7, 2018

permanentstar commented May 14, 2018 • edited Loading

update:

bsikander commented Nov 30, 2017 •

edited

Loading

codecov-io commented Nov 30, 2017 •

edited

Loading

velvia Dec 21, 2017 •

edited

Loading

bsikander commented Jan 24, 2018 •

edited

Loading

permanentstar commented May 14, 2018 •

edited

Loading