Allow running PS nodes on the spark driver #183

winston-zillow · 2017-12-05T20:04:51Z

The current behavior is to start the tensorflow PS nodes on an spark executor. This would mean wasting GPUs available on that node. These changes allow one to start the PS nodes in the driver while the workers are started on the spark executors

anfeng · 2017-12-15T18:09:17Z

@winston-zillow please address the conflict

leewyang · 2017-12-19T00:35:27Z

@winston-zillow Finally had time to take a more detailed look at this in my environments (Spark Standalone, Hadoop/YARN).

In my setup, I saw the following:

It looks like the Spark job isn't stopping cleanly (I'm assuming due to the PS thread).
I had to explicitly set --cluster_size equal to --num-executors plus one for the PS node.
I had to run a TensorFlow/CPU build on the driver, while running a TensorFlow/GPU build on the executors. Not sure if this is well supported by TF, but it worked.

Have you seen similar issues in your env? If not, can you describe your setup?

winston-zillow · 2018-01-11T01:40:36Z

@leewyang my spark job was able to completed and that TFonSpark joint successfully. I was on hadoop/YARN in EMR env and use the python codes to start the spark jobs. I haven't tried the spark submit. is that what you use? will try again to see if I have any issue.

leewyang · 2018-01-11T16:55:34Z

Yes, we use a dedicated Hadoop/YARN cluster with spark-submit.

# Conflicts: # examples/mnist/spark/mnist_spark.py # tensorflowonspark/TFCluster.py # tensorflowonspark/TFSparkNode.py # tensorflowonspark/pipeline.py

winston-zillow · 2018-01-19T00:34:56Z

@leewyang I fixed the problem of the driver node not terminating. Also this seems to work only in TENSORFLOW mode, so I put in a check.

I ran it successfully in Yarn/EMR environment

export SPARK_HOME=/usr/lib/spark
export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
export PYTHONFAULTHANDLER=true

# note: --cluster-size is 1 + num. executor
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 2 \
--executor-memory 8G \
--executor-cores 1 \
--py-files tensorflowonspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=16G \
--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/jvm/java/jre/lib/amd64/server:/usr/lib/jvm/java/jre/lib/amd64:/tmp/libhdfs \
--conf spark.executorEnv.HADOOP_HDFS_HOME=$HADOOP_HDFS_HOME \
TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--steps 10 \
--images /user/hadoop/mnist_data/train/images \
--labels /user/hadoop/mnist_data/train/labels \
--format csv \
--mode train \
--model mnist_model \
--driver_ps_nodes True \
--cluster_size 3

Console log at driver:

2018-01-19 00:13:05,432 INFO (MainThread-26109) Shutting down cluster
2018-01-19 00:13:05,633 INFO (Thread-3-26109) Got msg: None
2018-01-19 00:13:05,633 INFO (Thread-3-26109) Terminating PS
2018-01-19T00:13:10.837377 ===== Stop
18/01/19 00:13:10 INFO SparkContext: Invoking stop() from shutdown hook
18/01/19 00:13:10 INFO SparkUI: Stopped Spark web UI at http://172.30.0.203:4040
18/01/19 00:13:11 INFO YarnClientSchedulerBackend: Interrupting monitor thread
18/01/19 00:13:11 INFO YarnClientSchedulerBackend: Shutting down all executors
18/01/19 00:13:11 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/01/19 00:13:11 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
18/01/19 00:13:11 INFO YarnClientSchedulerBackend: Stopped
18/01/19 00:13:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/19 00:13:11 INFO MemoryStore: MemoryStore cleared
18/01/19 00:13:11 INFO BlockManager: BlockManager stopped
18/01/19 00:13:11 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/19 00:13:11 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/01/19 00:13:11 INFO SparkContext: Successfully stopped SparkContext
18/01/19 00:13:11 INFO ShutdownHookManager: Shutdown hook called
18/01/19 00:13:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-0f01312e-b879-4714-870a-6a9a06279ecd/pyspark-e61b56a4-9c2f-4080-9cd0-7bbf8fdf3a93
18/01/19 00:13:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-0f01312e-b879-4714-870a-6a9a06279ecd
-bash-4.2$

leewyang

This looks good. I was able to run successfully in my environment, however I have one comment for something that tripped me up during testing.

leewyang · 2018-01-22T21:55:49Z

tensorflowonspark/TFCluster.py

    :tensorboard: boolean indicating if the chief worker should spawn a Tensorboard server.
    :input_mode: TFCluster.InputMode
    :log_dir: directory to save tensorboard event logs.  If None, defaults to a fixed path on local filesystem.
+    :driver_ps_nodes: run the PS nodes on the driver locally instead of on the spark executors; this help maximizing computing resources (esp. GPU).


So previously, we had: num_workers + num_ps = cluster_size, where cluster_size == num_executors.
With the --driver_ps_nodes option, this is now a bit different, since num_workers == num_executors.

Can you add something like you will need to set cluster_size = num_executors + num_ps?

OK, will do

leewyang · 2018-01-25T19:26:58Z

Looks good. Thank you for your contribution.

bump version for #183

Allow running PS nodes on the spark driver

4c059d0

winston-zillow mentioned this pull request Dec 7, 2017

Allow running PS nodes on spark driver #186

Closed

winston-zillow added 5 commits January 18, 2018 14:20

Daemonize the PS thread and process

8f1f69a

Merge branch 'master' of https://github.com/yahoo/TensorFlowOnSpark

47697bc

# Conflicts: # examples/mnist/spark/mnist_spark.py # tensorflowonspark/TFCluster.py # tensorflowonspark/TFSparkNode.py # tensorflowonspark/pipeline.py

fix merge error; enable only for TENSORFLOW mode

94c8142

enable only for TENSORFLOW mode

a99a679

enable only for TENSORFLOW mode

d5c3220

leewyang requested changes Jan 22, 2018

View reviewed changes

add usage comment for driver_ps_nodes

8e690a1

leewyang approved these changes Jan 25, 2018

View reviewed changes

leewyang merged commit ce5e789 into yahoo:master Jan 25, 2018

leewyang mentioned this pull request Feb 2, 2018

I got Stuck when i submitted the spark task (TFOS MNIST the official sample) #217

Closed

leewyang added a commit that referenced this pull request Feb 6, 2018

bump version for #183

484bf90

eordentlich added a commit that referenced this pull request Feb 8, 2018

Merge pull request #221 from yahoo/leewyang_bump_ver

1c38778

bump version for #183

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow running PS nodes on the spark driver #183

Allow running PS nodes on the spark driver #183

Uh oh!

winston-zillow commented Dec 5, 2017

Uh oh!

anfeng commented Dec 15, 2017

Uh oh!

leewyang commented Dec 19, 2017

Uh oh!

winston-zillow commented Jan 11, 2018

Uh oh!

leewyang commented Jan 11, 2018

Uh oh!

winston-zillow commented Jan 19, 2018

Uh oh!

leewyang left a comment

Uh oh!

leewyang Jan 22, 2018

Uh oh!

winston-zillow Jan 25, 2018

Uh oh!

leewyang commented Jan 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow running PS nodes on the spark driver #183

Allow running PS nodes on the spark driver #183

Uh oh!

Conversation

winston-zillow commented Dec 5, 2017

Uh oh!

anfeng commented Dec 15, 2017

Uh oh!

leewyang commented Dec 19, 2017

Uh oh!

winston-zillow commented Jan 11, 2018

Uh oh!

leewyang commented Jan 11, 2018

Uh oh!

winston-zillow commented Jan 19, 2018

Uh oh!

leewyang left a comment

Choose a reason for hiding this comment

Uh oh!

leewyang Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

winston-zillow Jan 25, 2018

Choose a reason for hiding this comment

Uh oh!

leewyang commented Jan 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants