Skip to content

Conversation

@winston-zillow
Copy link

The current behavior is to start the tensorflow PS nodes on an spark executor. This would mean wasting GPUs available on that node. These changes allow one to start the PS nodes in the driver while the workers are started on the spark executors

@anfeng
Copy link
Contributor

anfeng commented Dec 15, 2017

@winston-zillow please address the conflict

@leewyang
Copy link
Contributor

@winston-zillow Finally had time to take a more detailed look at this in my environments (Spark Standalone, Hadoop/YARN).

In my setup, I saw the following:

  1. It looks like the Spark job isn't stopping cleanly (I'm assuming due to the PS thread).
  2. I had to explicitly set --cluster_size equal to --num-executors plus one for the PS node.
  3. I had to run a TensorFlow/CPU build on the driver, while running a TensorFlow/GPU build on the executors. Not sure if this is well supported by TF, but it worked.

Have you seen similar issues in your env? If not, can you describe your setup?

@winston-zillow
Copy link
Author

@leewyang my spark job was able to completed and that TFonSpark joint successfully. I was on hadoop/YARN in EMR env and use the python codes to start the spark jobs. I haven't tried the spark submit. is that what you use? will try again to see if I have any issue.

@leewyang
Copy link
Contributor

Yes, we use a dedicated Hadoop/YARN cluster with spark-submit.

@winston-zillow
Copy link
Author

@leewyang I fixed the problem of the driver node not terminating. Also this seems to work only in TENSORFLOW mode, so I put in a check.

I ran it successfully in Yarn/EMR environment

export SPARK_HOME=/usr/lib/spark
export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
export PYTHONFAULTHANDLER=true

# note: --cluster-size is 1 + num. executor
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 2 \
--executor-memory 8G \
--executor-cores 1 \
--py-files tensorflowonspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=16G \
--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/jvm/java/jre/lib/amd64/server:/usr/lib/jvm/java/jre/lib/amd64:/tmp/libhdfs \
--conf spark.executorEnv.HADOOP_HDFS_HOME=$HADOOP_HDFS_HOME \
TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--steps 10 \
--images /user/hadoop/mnist_data/train/images \
--labels /user/hadoop/mnist_data/train/labels \
--format csv \
--mode train \
--model mnist_model \
--driver_ps_nodes True \
--cluster_size 3

Console log at driver:

2018-01-19 00:13:05,432 INFO (MainThread-26109) Shutting down cluster
2018-01-19 00:13:05,633 INFO (Thread-3-26109) Got msg: None
2018-01-19 00:13:05,633 INFO (Thread-3-26109) Terminating PS
2018-01-19T00:13:10.837377 ===== Stop
18/01/19 00:13:10 INFO SparkContext: Invoking stop() from shutdown hook
18/01/19 00:13:10 INFO SparkUI: Stopped Spark web UI at http://172.30.0.203:4040
18/01/19 00:13:11 INFO YarnClientSchedulerBackend: Interrupting monitor thread
18/01/19 00:13:11 INFO YarnClientSchedulerBackend: Shutting down all executors
18/01/19 00:13:11 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/01/19 00:13:11 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
18/01/19 00:13:11 INFO YarnClientSchedulerBackend: Stopped
18/01/19 00:13:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/19 00:13:11 INFO MemoryStore: MemoryStore cleared
18/01/19 00:13:11 INFO BlockManager: BlockManager stopped
18/01/19 00:13:11 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/19 00:13:11 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/01/19 00:13:11 INFO SparkContext: Successfully stopped SparkContext
18/01/19 00:13:11 INFO ShutdownHookManager: Shutdown hook called
18/01/19 00:13:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-0f01312e-b879-4714-870a-6a9a06279ecd/pyspark-e61b56a4-9c2f-4080-9cd0-7bbf8fdf3a93
18/01/19 00:13:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-0f01312e-b879-4714-870a-6a9a06279ecd
-bash-4.2$ 

Copy link
Contributor

@leewyang leewyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I was able to run successfully in my environment, however I have one comment for something that tripped me up during testing.

:tensorboard: boolean indicating if the chief worker should spawn a Tensorboard server.
:input_mode: TFCluster.InputMode
:log_dir: directory to save tensorboard event logs. If None, defaults to a fixed path on local filesystem.
:driver_ps_nodes: run the PS nodes on the driver locally instead of on the spark executors; this help maximizing computing resources (esp. GPU).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So previously, we had: num_workers + num_ps = cluster_size, where cluster_size == num_executors.
With the --driver_ps_nodes option, this is now a bit different, since num_workers == num_executors.

Can you add something like you will need to set cluster_size = num_executors + num_ps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do

@leewyang
Copy link
Contributor

Looks good. Thank you for your contribution.

@leewyang leewyang merged commit ce5e789 into yahoo:master Jan 25, 2018
leewyang added a commit that referenced this pull request Feb 6, 2018
eordentlich added a commit that referenced this pull request Feb 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants