-
Notifications
You must be signed in to change notification settings - Fork 942
Allow running PS nodes on the spark driver #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@winston-zillow please address the conflict |
|
@winston-zillow Finally had time to take a more detailed look at this in my environments (Spark Standalone, Hadoop/YARN). In my setup, I saw the following:
Have you seen similar issues in your env? If not, can you describe your setup? |
|
@leewyang my spark job was able to completed and that TFonSpark joint successfully. I was on hadoop/YARN in EMR env and use the python codes to start the spark jobs. I haven't tried the spark submit. is that what you use? will try again to see if I have any issue. |
|
Yes, we use a dedicated Hadoop/YARN cluster with spark-submit. |
# Conflicts: # examples/mnist/spark/mnist_spark.py # tensorflowonspark/TFCluster.py # tensorflowonspark/TFSparkNode.py # tensorflowonspark/pipeline.py
|
@leewyang I fixed the problem of the driver node not terminating. Also this seems to work only in I ran it successfully in Yarn/EMR environment Console log at driver: |
leewyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I was able to run successfully in my environment, however I have one comment for something that tripped me up during testing.
tensorflowonspark/TFCluster.py
Outdated
| :tensorboard: boolean indicating if the chief worker should spawn a Tensorboard server. | ||
| :input_mode: TFCluster.InputMode | ||
| :log_dir: directory to save tensorboard event logs. If None, defaults to a fixed path on local filesystem. | ||
| :driver_ps_nodes: run the PS nodes on the driver locally instead of on the spark executors; this help maximizing computing resources (esp. GPU). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So previously, we had: num_workers + num_ps = cluster_size, where cluster_size == num_executors.
With the --driver_ps_nodes option, this is now a bit different, since num_workers == num_executors.
Can you add something like you will need to set cluster_size = num_executors + num_ps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do
|
Looks good. Thank you for your contribution. |
The current behavior is to start the tensorflow PS nodes on an spark executor. This would mean wasting GPUs available on that node. These changes allow one to start the PS nodes in the driver while the workers are started on the spark executors