Contribute spark-tensorflow-distributor to the ecosystem #154

sarthfrey · 2020-04-01T22:38:37Z

This PR aims to act as both a proposal and an initial version for a contribution of the spark-tensorflow-distributor python package. As mentioned in #151 the general mandate of this package is to make it easier for users to do distributed training with TensorFlow 2 on their Spark clusters. Currently this package primarily acts as a job launcher for starting TensorFlow servers, configuring GPU and CPU resources for the user based on Spark resource scheduling so that they may easily run their deep learning workloads.

This PR also includes CI with GitHub workflows, which acts at the repository level by default. However, the CI is set up so that the checks will only be triggered by changes to this package's subdirectory in the ecosystem. This behavior is described in .github/workflows/spark-tensorflow-distributor.yml.

I'd also like to publish this package to PyPi and am wondering if there's an ecosystem specific process for that.

Welcoming any and all feedback on this PR :)

cc @guptapriya @mengxr @husseinnagr-db

guptapriya · 2020-04-02T18:21:44Z

cc @yuefengz can you help review this PR?

ghost · 2020-04-07T20:29:52Z

@yuefengz ping :)

mengxr

haven't finished a full pass yet

spark/spark-tensorflow-distributor/Dockerfile

mengxr · 2020-04-07T18:23:45Z

spark/spark-tensorflow-distributor/Dockerfile

+RUN apt-get install -y python${PYTHON_INSTALL_VERSION} python${PYTHON_INSTALL_VERSION}-dev python${PYTHON_INSTALL_VERSION}-distutils && \
+    apt-get clean && \
+    wget https://bootstrap.pypa.io/get-pip.py && \
+    python$PYTHON_INSTALL_VERSION get-pip.py && \


If we need different python versions, we should consider miniconda.

conda create -n my_env python=3.6
conda env update -n my_env -f environment.yml # do not specify python version in env.yml

spark/spark-tensorflow-distributor/Dockerfile

mengxr · 2020-04-07T18:37:30Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+        """
+        if gpu_resource_name in resources:
+            addresses = resources[gpu_resource_name].addresses
+            pattern = re.compile('^[1-9][0-9]*|0$')


Is it simpler to try 'int(..)' and throw value error?

int will allow allow zero padding which I'd prefer to disallow

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

spark/spark-tensorflow-distributor/Dockerfile

yuefengz · 2020-04-08T06:06:51Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

@@ -0,0 +1,331 @@
+import logging


Is it possible for you to follow the Google Python style guide? http://google.github.io/styleguide/pyguide.html

Generally speaking, that guides recommends to use pylint, break lines at 80 chars, use a different form of docstrings etc.

Used the google yapf formatter to format to google style so hopefully most of the major formatting issues are out of the way. Added yapf + pylint check to ci too.

Some suggestions but up to you:
would be nicer to break lines at column 80: http://google.github.io/styleguide/pyguide.html#32-line-length
docstring of function arguments should follow this format: http://google.github.io/styleguide/pyguide.html#doc-function-args

yuefengz · 2020-04-08T06:10:54Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+
+    .. note:: See more at https://www.tensorflow.org/guide/distributed_training
+    """
+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):


Suggestion: would breaking it into two or more argument make it clearer? such as num_gpus and num_workers or num_replicas and local. In tf.distribute, we use the term "replica" quite often.

We preferred to stick to a single scaling parameter to keep the simplicity of swapping between CPU and GPU workflows. Few other names for the param: scale, n. Agree with adding a local=False or local_mode=False parameter, will add that.

yuefengz · 2020-04-08T06:11:42Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+
+    .. note:: See more at https://www.tensorflow.org/guide/distributed_training
+    """
+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):


Suggestion: would changing it to "use_gpu=True" make it clearer?

We would have to split it into 2 parameters because we can't assume the the gpu resource name in the spark conf is 'gpu', but I think that makes sense for the sake of clarity 👍

yuefengz · 2020-04-08T06:15:42Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+            one for the user and wraps the training function in the strategy context, allowing the user to provide
+            non-distributed TensorFlow code that is executed as distributed code.
+
+            Example with use_custom_strategy=True:


We can tell whether users do it in the wrong way (use_custom_strategy=False but create a strategy) and throw an exception?

Nested tensorflow strategy scopes raise an exception by default - are you suggesting catching and re-raising a more informative exception?

ah I think that is good enough.

yuefengz · 2020-04-08T06:23:05Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+            the Spark configuration. Note also that for GPU training, num_slots will limit the number of GPUs used
+            for training even if more are available, so that exactly num_slots GPUs are used in total. Spark does not
+            restrict CPU cores for tasks and so for CPU training, num_slots rarely needs to be greater than the
+            number of workers and for local mode set num_slots=-1.


num_slots rarely needs to be greater than the number of workers
Does that mean in some cases, the number of available workers can be less than num_slots?

Yes, since num_slots represent the number of spark task slots in CPU training, a user might want to say run two TensorFlow training workers and each Spark worker.

yuefengz · 2020-04-08T06:33:00Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+                    import tensorflow as tf
+                    # training code
+        """
+        self.logger = _get_logger(self.__class__.__name__)


For private fields, we usually hide them by prepending a "_" in their names.

yuefengz · 2020-04-08T07:02:15Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+                    'please contact your cluster administrator.'
+                    f'The conf `{key}` was not found in the Spark configuration.'
+                )
+            task_gpu_amount = int(self.sc.getConf().get(key))


Is it always True that in a spark cluster all workers have the same number of GPUs?

No, but Spark 3.0's resource aware scheduling (used here) guarantees that each Spark task is allocated the same number of GPUs. This value is specified by the cluster admin in the Spark conf spark.task.resource.gpu.amount and the task program can find which GPUs it is allocated by Spark with the BarrierTaskContext.resources() method.

yuefengz · 2020-04-08T07:10:33Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+            def set_gpus(context):
+                gpus_owned = get_gpus_owned(context.resources(), gpu_resource_name)
+                my_num_gpus = (num_slots // num_tasks) + (context.partitionId() < (num_slots % num_tasks))
+                gpu_addresses = [str(e) for e in random.sample(gpus_owned, my_num_gpus)]


Can I assume that all GPUs on a machine will be taken except for the last machine? The reason I ask is randomly chosen GPUs may not have NVLinks in between.

Ah good point, we can't. For example, if we do GPU training with 6 GPUs (num_slots=6) on a 2 worker cluster, each with 4 GPUs, then the training will be done with 3 GPUs on each worker, rather than 4 GPUs on one worker and 2 GPUs on the other. Currently Spark's resource scheduling doesn't support resource groups, so I'd suggest leaving this as a TODO until we can use that Spark feature to report the NVLink groups to Spark.

mengxr · 2020-04-09T22:54:42Z

@jhseu What is your recommended CI solution? This PR uses github workflow. Is it okay?

spark/spark-tensorflow-distributor/tests/integration/test_mirrored_strategy_runner.py

mengxr · 2020-04-09T23:10:38Z

spark/spark-tensorflow-distributor/tests/integration/test_mirrored_strategy_runner.py

+        return os.environ['CUDA_VISIBLE_DEVICES']
+
+    with pytest.raises(Exception):
+        MirroredStrategyRunner(num_slots=2, gpu_resource_name='gpu').run(train_fn)


Should also verify users can set a conf to ignore it.

spark/spark-tensorflow-distributor/tests/integration/test_mirrored_strategy_runner.py

mengxr · 2020-04-09T23:13:49Z

spark/spark-tensorflow-distributor/tests/integration/set_spark_conf.py

+            k, v = l.split(None, 1)
+            conf[k] = v
+
+with open('tests/integration/spark_conf/spark-defaults.conf', 'w') as f:


minor: the names "base", "custom", "defaults" are a little confusing

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

mengxr · 2020-04-15T21:49:59Z

Ping @jhseu and @yuefengz again for CI choices.

jhseu · 2020-04-20T23:33:20Z

For CI, GitHub workflow is fine for now, I think. We haven't setup an alternative for this repo.

yuefengz

The PR looks good to me overall though I have some small suggestions and questions. Thank you!

yuefengz · 2020-04-21T07:10:22Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+            one for the user and wraps the training function in the strategy context, allowing the user to provide
+            non-distributed TensorFlow code that is executed as distributed code.
+
+            Example with use_custom_strategy=True:


ah I think that is good enough.

yuefengz · 2020-04-21T23:54:12Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+                    'please contact your cluster administrator.'
+                    f'The conf `{key}` was not found in the Spark configuration.'
+                )
+            task_gpu_amount = int(self.sc.getConf().get(key))


Maybe you can call _get_gpus_owned here? Looks like there are two different ways to get the number of GPUs. Are they different? If not, could you consolidate them?

yuefengz · 2020-04-22T00:18:07Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

+    @staticmethod
+    def _get_gpus_owned(resources, gpu_resource_name):
+        """
+        Gets the number of GPUs that Spark scheduled to the calling task


Looks like this method is returning "the number of GPUs"?

yuefengz · 2020-04-22T00:34:14Z

spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

@@ -0,0 +1,331 @@
+import logging


Some suggestions but up to you:
would be nicer to break lines at column 80: http://google.github.io/styleguide/pyguide.html#32-line-length
docstring of function arguments should follow this format: http://google.github.io/styleguide/pyguide.html#doc-function-args

yuefengz · 2020-04-22T01:00:28Z

spark/spark-tensorflow-distributor/tests/integration/set_spark_conf.py

@@ -0,0 +1,53 @@
+import argparse


Curious what this file does? Would you mind adding some description in the beginning of this file?

sarthfrey · 2020-04-23T23:56:15Z

Thanks @yuefengz! I addressed all the feedback. @jhseu are we good to merge?

sarthfrey · 2020-04-24T00:42:04Z

Made a release here: https://pypi.org/project/spark-tensorflow-distributor/0.0.3/

ghost · 2020-04-27T17:59:24Z

@guptapriya @yuefengz @jhseu also please let me know who from TensorFlow I should add as an owner on the PyPi project :)

Contribute spark-tensorflow-distributor to the ecosystem

ab60b84

googlebot added the cla: yes label Apr 1, 2020

yuefengz self-requested a review April 3, 2020 06:47

mengxr suggested changes Apr 7, 2020

View reviewed changes

yuefengz reviewed Apr 8, 2020

View reviewed changes

sarthfrey added 10 commits April 8, 2020 10:29

pin testing container to ubuntu 18.04

ad324cc

address xiangrui's review except for ssh

f5a07ed

rm ssh from testing container

a1c9a45

fix none type not iterable

95a3569

test

37f0148

test2

88b23bf

test3

ec53e7a

test4

291dd86

fix libc installation

e935278

change integration tests to pull image

9f96698

mengxr suggested changes Apr 9, 2020

View reviewed changes

sarthfrey added 2 commits April 9, 2020 21:10

autoformat

5c6a349

address review comments

aafafce

sarthfrey requested a review from yuefengz April 16, 2020 18:06

jhseu requested a review from guptapriya April 20, 2020 23:34

guptapriya removed their request for review April 21, 2020 05:35

yuefengz approved these changes Apr 22, 2020

View reviewed changes

address yuefeng review 2

27f22f3

clarify readme

1da85c8

sarthfrey added 2 commits April 23, 2020 17:15

update version

8c0528e

update version

177db92

jhseu merged commit 4fba556 into tensorflow:master Apr 27, 2020

Contribute spark-tensorflow-distributor to the ecosystem #154

Contribute spark-tensorflow-distributor to the ecosystem #154

Conversation

sarthfrey commented Apr 1, 2020

guptapriya commented Apr 2, 2020

ghost commented Apr 7, 2020

mengxr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarthfrey Apr 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr commented Apr 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr commented Apr 15, 2020

jhseu commented Apr 20, 2020

yuefengz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuefengz Apr 22, 2020 • edited

Choose a reason for hiding this comment

sarthfrey commented Apr 23, 2020

sarthfrey commented Apr 24, 2020

ghost commented Apr 27, 2020

sarthfrey Apr 9, 2020 •

edited

yuefengz Apr 22, 2020 •

edited