Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(jobserver): Replace manager_start with SparkLauncher #982

Merged
merged 7 commits into from
Feb 7, 2018

Conversation

bsikander
Copy link
Contributor

@bsikander bsikander commented Nov 30, 2017

Idea

  • Move variables from manager_start and server_start to setenv.sh
  • Use set -a in server_start.sh to expose all the variables to environment on SJS startup.
    These variables are then used to launch driver JVM.
  • Move variables like jar file, conf etc from manager_start.sh to setenv.sh to make it more
    configurable
  • Replace spawning of manager_start in AkkaClusterSupervisor with Launcher

Testing

This change was tested with client and cluster mode (standalone) and seems to work fine.
Can somebody please verify it with Mesos/Yarn?

Pull Request checklist

  • The commit(s) message(s) follows the contribution guidelines ?
  • Tests for the changes have been added (for bug fixes / features) ?
  • Docs have been added / updated (for bug fixes / features) ?

Current behavior :
Manager_start.sh script is used to spawn driver JVMs. To get the current status of job from Spark, the only way is to use the History server. Which is not a good approach.

New behavior :
With SparkLauncher, we can use the SparkAppHandle to access the current state (WAITING/KILLED etc) of job.

BREAKING CHANGES
This change works with Spark 2.1.0 and above. This is due to the redirectOutput and redirectError functions which were introduced in 2.1.0.

Other information:
If this change gets through, then server_start.sh will also be replaced with launcher and then another change will introduce the changes related to accessing the current state of job in Spark.


This change is Reviewable

* Move variables from manager_start and server_start to setenv
* Use set -a in server_start.sh to expose all the variables in env.
These variables are then used to launch driver JVM.
* Move variables like jar file, conf etc to setenv to make it more
configurable
* Replace spawning of manager_start in AkkaClusterSupervisor to
Launcher
* Add abstract class Launcher to hold common variables and functions
* Add concrete implementation for manager with manager specific
properties
* Update documentation
@codecov-io
Copy link

codecov-io commented Nov 30, 2017

Codecov Report

Merging #982 into master will decrease coverage by 1.99%.
The diff coverage is 5.79%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #982   +/-   ##
=======================================
- Coverage   72.48%   70.49%   -2%     
=======================================
  Files          77       78    +1     
  Lines        2410     2437   +27     
  Branches      128      212   +84     
=======================================
- Hits         1747     1718   -29     
- Misses        663      719   +56
Impacted Files Coverage Δ
...a/spark/jobserver/AkkaClusterSupervisorActor.scala 0% <0%> (ø) ⬆️
...n/scala/spark/jobserver/util/ManagerLauncher.scala 0% <0%> (ø)
...src/main/scala/spark/jobserver/util/Launcher.scala 0% <0%> (ø)
...ain/scala/spark/jobserver/util/SparkJobUtils.scala 97.36% <80%> (-2.64%) ⬇️
...in/scala/spark/jobserver/cassandra/Cassandra.scala 42.85% <0%> (-7.15%) ⬇️
...-api/src/main/scala/spark/jobserver/JobCache.scala

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9efae97...532946d. Read the comment docs.

@bsikander
Copy link
Contributor Author

@derSascha can you check this PR with Yarn?
@velvia can you have a look at this ?

@ecandreev
Copy link
Contributor

I tried it for cluster mode, works for me.

README.md Outdated
MANAGER_LOGGING_OPTS="-Dlog4j.configuration=$REMOTE_JOBSERVER_DIR/log4j-cluster.properties"
```

- Cluster mode for mesos/yarn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have a separate example for each of mesos and yarn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will add separate examples.

Copy link
Contributor

@velvia velvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine for the most part, thanks, and is a great change. What environments have you tested this under?

@bsikander
Copy link
Contributor Author

I have tested it with

  • local[*]
  • spark cluster with client and cluster mode

Testing on Mesos/Yarn is required.

@bsikander
Copy link
Contributor Author

@velvia I added separate examples for Yarn/Mesos.

@thoHeinze
Copy link

@derSascha Any change you can take a look on this PR ?

@derSascha
Copy link
Contributor

@bsikander thanks for the improvement. i don't have the time right now to test the cluster mode, but tested the yarn client mode. Some points i changed while testing around / suggestions / ideas:

  1. Move the MANAGER_... variables into the settings.sh file and set some defaults in the setenv.sh. Some examples should be added to the templates in config.
  2. Replace this lines with this, otherwise loading of settings.sh failes if the variables are moved.
  3. Move the examples from the main README into the doc/cluster,mesos,yarn docs.
  4. It would be nice to set driver-memory and driver-cores via context settings (see feat(jobserver): Pass cxt config to manager start #988). We can use spark.driver.memory and spark.driver.cores from the context config to set the args if present and use the JOBSERVER_MEMORY as fallback like already implemented.
  5. We should look that other requirements from feat(jobserver): Pass cxt config to manager start #988 are fixed, because any special manager_start.sh code can't be implemented anymore. @addisonj any suggestions?
  6. Can we use the SparkAppHandle to handle things like YARN app was killed via YARN?

Thanks for the pull-request using the launcher seems to be more stable way than the bash script workaround!

@derSascha
Copy link
Contributor

What about something like this in the context configs:

spark.submit.args.conf = ["foo=bar", "c=d",...]

that is parsed and added to the arguments of the launcher.

@addisonj
Copy link
Contributor

hrm... so I am torn on this... I think it is sort of hard to generalize on all the differences between how spark-standalone, yarn, and mesos work and manager_start.sh has been a decent (albeit hacky) way to handle those differences.

Mesos is particularly challenging due to the way it operates.

For example, when you are submitting a driver to be scheduled, the (i.e. manager_start.sh calling spark-submit) the --master` arg is the URI of the mesosClusterDispatcher, but once the JobManager starts, we need to change that master to be the mesos master (same as it would be in mesos client).

Another challenge is args like --files. In yarn, these files are copied from local to be available to each machine dynamically. However, in mesos, it is assumed all the URIs passed in files are available to each machine already. This meant that to get things to function properly, I had to create a nginx in front of SJS to serve my SJS jar and configuration file and things like MESOS_SANDBOX and spark.mesos.uris to get files where expected.

There are also a number of options which I would want to dynamically configure beyond just driver-memory and driver-cores. For example, spark 2.2.1 a new config is added, spark.mesos.driver.constraints which is only used when scheduling a driver that I would like to configure dynamically based on options passed in creating a context. It would be pretty annoying for any new options like that to require a code change.

I think @derSascha is on the right path with spark.submit.args.conf, so long as it could be set (and probably merged with static options) when creating a context and perhaps could also interpolate environment variables in sane way.

My proposal might be a new namespace of args:spark.manager

spark.manager would be where all arguments that are used in the SparkLauncher would be sourced. These arguments could either be set statically in the config or at runtime during context creation. For example, the master would be spark.manager.master would be the value used for master argument in spark launcher. However, it would have a default value of spark.master if not set, for sane fallback for simpler cases. This would allow for me to schedule dynamically into multiple mesosClusterDispatcher (which could be useful for determining which machines I run on) or could just be set statically. Assuming it isn't too crazy, it might simplify things to allow these values to reference other variables or even interpolate environment variables, for cases like the MESOS_SANDBOX env var which allows me to access spark.mesos.uris downloaded into the running container (but that might not be required).

Another example would be passing --conf options, spark.manager.conf would be a map of k/v pairs to pass as conf options, and these could all be configured statically or merged with runtime options from context creation.

Sorry for the wall of text, but for this to be flexible enough, I think we would need something like above. Assuming we get it right, I think it would be a lot better than manager_start.sh

@bsikander
Copy link
Contributor Author

@derSascha thank you for the comments.

1- A new file named settings.sh should be added with all the specific settings ? Common settings will go to setenv.sh and all others to settings?
2- OK, I will add this
3- OK, I will add this
4- OK I will add these settings driver-memory and driver-cores
5- Regarding spark.submit.args.conf = ["foo=bar", "c=d",...], in the current PR you can set the key/value pairs in setenv.sh using MANAGER_EXTRA_SPARK_CONFS. User can specify key/value like

MANAGER_EXTRA_SPARK_CONFS="foo=bar|c=d"

In Launcher, I am already parsing this and adding the key/val as --conf.

6- Currently YARN killed event is handled through SparkListener, we can also implement it via SparkAppHandler but I don't see any benefit if I use SparkAppHandler.

@derSascha
Copy link
Contributor

@bsikander

  1. I mean the config/<environment>.sh that gets renamed to settings.sh while packaging and than sourced here
  2. It would be nice to set this options on a per context basis including environment variables that are evaluated at runtime (see the post from @addisonj)
  3. Oops, my fault. Ignore this point.

@addisonj
The --files thing sound like the Spark Standalone Cluster mode. In the current version, set REMOTE_JOBSERVER_DIR to some base URL in your settings.sh and the manager_start.sh should prefix all files with this URL.

About the config part: I am not the pro in scala config parsing and evaluation, but this sounds like a very important feature while removing manager_start.sh.

@velvia
Copy link
Contributor

velvia commented Dec 17, 2017

@addisonj @derSascha so let me summarize what you are saying to make sure I understand...

The Spark arguments, even such as spark.master, that the Launcher has to use (or equiv spark-submit params) maybe different than what the process once launched should use (what is actually passed by SJS ContextManager to each context process, as a config), thus we need a separate configuration aside from the simple spark.master stuff in the context config.

Is that right?

I kind of like spark.launcher better than spark.manager, because when I see manager I think of the JobManager which is our (SJS's) wrapper for starting a job in context-per-jvm.

@bsikander
Copy link
Contributor Author

@velvia @derSascha @addisonj
Let's finalize, I agree with velvia spark.launcher is better. User can set any spark configuration like spark.proxy.user or spark.driver.cores or spark.driver.memory statically in config file or during POST /context. We combine both static and dynamic configs and use a loop to add all the configs (similar to passthrough section).

As a side note:
We can even do this change without spark.launcher section. We can just use spark section. We can put all the configs directly in this section and read the static config like

val test = testConfig.getConfig("spark")
                      .withoutPath("context-settings")
                      .withoutPath("contexts")
                      .withoutPath("jobserver")

Let's finish this because other changes are being blocked by this change.

bin/setenv.sh Outdated
MANAGER_EXTRA_SPARK_CONFS=
MANAGER_LOGGING_OPTS="-Dlog4j.configuration=file:$appdir/log4j-server.properties"
SPARK_LAUNCHER_VERBOSE=0
if [ -z "$MANAGER_JAR_FILE" ]; then
Copy link
Contributor

@velvia velvia Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can do ${MANAGER_JAR_FILE:="$appdir/spark-job-server.jar"}

### Modifying the &lt;environment&gt;.sh script
Replace `MANAGER_*` variables with
```
MANAGER_JAR_FILE="$appdir/spark-job-server.jar"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be good to put these alternatives in setenv.sh as well, after all the Mesos one is there but YARN isn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but at this point, I don't have any variable which tells me the current spark.submit.deployMode. I do have a path to *.conf file and somehow within BASH, I can try to parse this. What do you say ?
Also @derSascha any comments?

@velvia
Copy link
Contributor

velvia commented Dec 21, 2017

@bsikander the changes look good to me.

There is currently the contextconfig section for defaults. Why can't we just pull the spark settings from there for the Launcher? Or is it because we cannot tell which ones need to be preset before calling the launcher? Rather than creating a whole new config section, I'd rather create a "launcher" section for launcher specific configs. For example

spark.context-settings.launcher.spark.driver.cores = ...

(if that makes sense)

@bsikander
Copy link
Contributor Author

@velvia
Ok, as far as I know, we can add the launcher configs in spark.context-settings. One thing which is not clear is that should I keep the passthrough section, which is parsed like this ?

Configs defined in passthrough section are added to sparkConf when the JobManagerActor is already initialized (i.e. Driver JVM already started), all of these settings can also be added as --conf during the spark-submit/sparkLauncher. Should I just remove the passthrough section and add a launcher section which contains all the static configs and can be overriden by POST /context?

@velvia
Copy link
Contributor

velvia commented Dec 21, 2017 via email

@bsikander
Copy link
Contributor Author

bsikander commented Jan 24, 2018

Sorry for being away for some time, I was on holidays.

So, I have added a new section in config named spark.context-settings.launcher, user can add spark properties in this section and will be added to spark-submit by SparkLauncher.
e.g.

curl -X POST "localhost:8090/contexts/streaming?context-factory=spark.jobserver.context.StreamingContextFactory&launcher.spark.driver.cores=1&launcher.spark.driver.memory=512m&launcher.spark.master=spark://<IP>:7077"

Note: Now JOBSERVER_MEMORY environment variable is not used anymore for SparkLauncher to set spark.driver.memory. Instead spark.driver.memory is picked up from either application.conf or <environment>.conf

A few things that we need to take care of in future regarding spark.context-settings, user can add spark configurations in the following areas
1- Directly under spark.context-settings(e.g. spark.context-settings.spark.driver.memory)
2- Under spark.context-settings.passthrough
3- Under spark.context-settings.hadoop
4- Under spark e.g. spark.master

All the above settings are added to SparkConf like this.

Now, we have a new section for Launcher and properties can be set under spark.context-settings.launcher.

Since, sparkConf has more precedence than spark-submit, configurations set in launcher can be overriden at runtime and user can get confused.
For compatibility purpose, we can keep them for now but in future we should only have 2 sections for clarity.
1- spark.context-settings.launcher
2- spark.context-settings.conf

@bsikander
Copy link
Contributor Author

@velvia @derSascha @addisonj can you have a look on my new commit?

@bsikander
Copy link
Contributor Author

Can someone please have a look on this change?

@bsikander
Copy link
Contributor Author

@velvia I have some other changes that I want to push on top of this change. Would be nice if you can have a look on this.

@velvia
Copy link
Contributor

velvia commented Feb 7, 2018

@bsikander the last commit looks fine, sorry I have been busy with work and away on holiday also. Going to merge it.

@velvia velvia merged commit cf5f1a2 into spark-jobserver:master Feb 7, 2018
@bsikander
Copy link
Contributor Author

Thank you merging :)

@thoHeinze thoHeinze mentioned this pull request Apr 6, 2018
@permanentstar
Copy link

permanentstar commented May 14, 2018

I encounter context start fail when use this feathure on yarn cluster mode
spark driver(different node with jobserver) throw out exception said Could "not find system configuration file"
contrast between v0.8.0 and master build, the master build miss conf file and log4j file
I didn't change any relative env setting in settings.sh
hdfs dfs -ls /user/hadoop/.sparkStaging/application_1524042746805_0147
Found 4 items
-rw-r--r-- 3 hadoop supergroup 32649 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/spark_conf.zip
-rw-r--r-- 3 hadoop supergroup 4060 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/cluster.conf
-rw-r--r-- 3 hadoop supergroup 620 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/log4j-cluster.properties

-rw-r--r-- 3 hadoop supergroup 40539377 2018-05-14 17:40 /user/hadoop/.sparkStaging/application_1524042746805_0147/spark-job-server.jar
hdfs dfs -ls /user/hadoop/.sparkStaging/application_1524042746805_0134
Found 2 items
-rw-r--r-- 3 hadoop supergroup 32260 2018-05-14 16:37 /user/hadoop/.sparkStaging/application_1524042746805_0134/spark_conf.zip
-rw-r--r-- 3 hadoop supergroup 40991132 2018-05-14 16:37 /user/hadoop/.sparkStaging/application_1524042746805_0134/spark-job-server.jar

Did I need some more configuration?

update:

I finally modify the code in spark.jobserver.util.ManagerLauncher and env variable to make it pass through on yarn cluster mode, can anyone check whether this has side effect, thank you.

var addFiles : Seq[String] = Seq(getEnvironmentVariable("MANAGER_CONF_FILE"))
      
val logConf = loggingOpts.split("""\s+""").map(o => o.split("="))
        .find(array => array(0).equals("-Dlog4j.configuration"))
      
if(logConf.isDefined){
        addFiles = addFiles :+ logConf.get(1)
      }
      
launcher.addSparkArg("--files", addFiles.mkString(","))
MANAGER_LOGGING_OPTS="-Dlog4j.configuration=log4j-cluster.properties"
JOBSERVER_CONFIG="cluster.conf"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants