Transformer multi gpu #4457

k-w-w · 2018-06-05T21:35:59Z

Hi all, I added DistributionStrategy to the Transformer model. Currently, the model isn't running very well with MirroredStrategy and I'm not sure why. @robieta @guptapriya As people familiar with DistributionStrategy, please help!

Current stats:

GPUs	Global steps/sec	batch size (per device)
1 GPU	1.11	4096
4 GPU	0.34	3072

I decreased the batch size because of OOM errors.

karmel

I assume this works; have you validated that it does for CPU, 1-GPU, and multi-GPU?

karmel · 2018-06-05T21:58:13Z

official/transformer/transformer_main.py

+  else:
+    params["batch_size"] = distribution_utils.per_device_batch_size(
+        flags_obj.batch_size or params["default_batch_size"],
+        flags_core.get_num_gpus(flags_obj))


nit: cleaner:

params["batch_size"] = flags_obj.batch_size or params["default_batch_size_tpu"] if not params["use_tpu"]: params["batch_size"] = distribution_utils.per_device_batch_size( params["batch_size"], flags_core.get_num_gpus(flags_obj))

karmel · 2018-06-05T21:58:53Z

official/utils/misc/distribution_utils.py

+# limitations under the License.
+# ==============================================================================
+"""Helper functions for running models in a distributed setting."""
+


Note to selves: Build files will be required for this.

karmel · 2018-06-05T21:59:52Z

official/utils/misc/distribution_utils.py

+
+  remainder = batch_size % num_gpus
+  if remainder:
+    err = ('When running with multiple GPUs, batch size '


nit: double quotes

robieta · 2018-06-06T19:44:08Z

FYI training blows up with synthetic data if the data_dir doesn't exist, even though it isn't used.

edit (kathy): can't reply, but this has been fixed. Had a typo that prevented the synthetic data flag from being seen.

k-w-w · 2018-06-07T02:13:50Z

This should be good to be reviewed. Thanks @guptapriya, @robieta, and @yuefengz for the DistributionStrategy help! We've determined that the embedding is slowing the model down. A feature request has been made to Dist Strat.

With the hierarchical all_reduce setting, the speeds are more reasonable:

GPUs	Global steps/sec	batch size (per device)
1 GPU	1.11	4096
4 GPU	0.57	3072
4 GPU (on an older TF rev)	0.66	3072

qlzh727

Thanks for adding distribution_utils

robieta · 2018-06-07T18:42:16Z

official/resnet/resnet_run_loop.py

-        num_gpus=flags_core.get_num_gpus(flags_obj)
-    )
+  distribution_strategy = distribution_utils.get_distribution_strategy(
+      flags_core.get_num_gpus(flags_obj), use_hierarchical_copy=True)


I think it would be better if the all_reduce algorithm was a performance flag since the optimal one will vary by hardware. For ResNet in particular, DistStrat auto selection should be fine in a week or so.

robieta · 2018-06-07T18:47:11Z

official/transformer/utils/dataset.py

+  dataset = tf.data.Dataset.from_tensors(tf.ones([batch, length], tf.int64))
+  dataset = dataset.map(lambda x: (x, x))
+  dataset = dataset.cache()
+  dataset = dataset.repeat(1000)


I would prefer that we don't hard code a dummy size into the synthetic data. Because there is already support for setting a step based schedule, infinite repeat should be fine. (This is what ResNet does.)

See similar message in #4476 , but we should make a helper fn for getting synthetic data in a particular shape... don't both do that, though.

karmel · 2018-06-07T19:51:36Z

official/transformer/transformer_main.py

    "base": model_params.BASE_PARAMS,
+    "base_multi_gpu": model_params.BASE_MULTI_GPU_PARAMS,
    "big": model_params.BIG_PARAMS,
+    "big_multi_gpu": model_params.BIG_MULTI_GPU_PARAMS,


I feel like it might be better to exclude these from choices, and hot-swap them out if there are multiple GPUs, logging a message so that people know it happened. Thoughts?

Yeah, I was conflicted about the choice between swapping the parameters vs creating a separate set of params were about equal. Usability-wise, I prefer swapping the params in the multi-gpu case so the running commands don't have to change as much. I ended up choosing the other option to maintain the idea to keep models "mathematically equivalent" when swapping to multiple GPUs (like how the batch size is global instead of per-device).

I don't have strong preferences. Do you still think we should go with swapping the params?

I was thinking have both sets of params, but don't include the multi-GPU in the mapping dict. Then, if multi GPU, swap the entire param set up front.

karmel · 2018-06-07T19:52:41Z

official/transformer/transformer_main.py

  """Add flags and flag validators for running transformer_main."""
  # Add common flags (data_dir, model_dir, train_epochs, etc.).
-  flags_core.define_base(multi_gpu=False, num_gpu=False)
+  flags_core.define_base(multi_gpu=False)


I don't think we need the multi_gpu=False here, and it is deceptive, so better to remove?

Perhaps we should remove the multi_gpu flag, and use num_gpu instead? MNIST uses multi_gpu but not num_gpu, and the resnet models use num_gpu but not multi_gpu.

I have a question/suggestion about the num_gpu flag (+@robieta about this) - Currently, the default value of num_gpu is 0 or 1 depending on if there is a GPU. Can we change this default to None instead?
OneDeviceStrategy requires more memory than having no DistributionStrategy (batch size of 4096 causes OOM errors when using OneDeviceStrategy). When the user doesn't specify num_gpus, we should default to using no DistributionStrategy.

Hm. Can you ping DistStrat and see if this is expected, and what the advantages of using OneDevice are/will be?

Hmm, tried running it again, and it went through without failing. Seems like this might be Transformer getting unlucky with the dynamic batching? For now, I'll leave it as is.

What are your thoughts about the num_gpu vs multi_gpu flags?

We want to get everything on num_gpus and remove multi_gpu altogether.

karmel · 2018-06-07T19:54:45Z

official/transformer/utils/dataset.py

+  dataset = tf.data.Dataset.from_tensors(tf.ones([batch, length], tf.int64))
+  dataset = dataset.map(lambda x: (x, x))
+  dataset = dataset.cache()
+  dataset = dataset.repeat(1000)


See similar message in #4476 , but we should make a helper fn for getting synthetic data in a particular shape... don't both do that, though.

karmel · 2018-06-07T19:57:14Z

official/utils/misc/distribution_utils.py

+import tensorflow as tf
+
+
+def get_distribution_strategy(num_gpus, use_hierarchical_copy=False):


args, returns? In particular, a description of hierarchical_copy would be good.

k-w-w · 2018-06-07T21:24:07Z

That's a good point. I'll send them a message.

…

On Thu, Jun 7, 2018 at 2:23 PM Taylor Robie ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In official/transformer/transformer_main.py <#4457 (comment)>: > @@ -353,12 +360,12 @@ def run_loop( def define_transformer_flags(): """Add flags and flag validators for running transformer_main.""" # Add common flags (data_dir, model_dir, train_epochs, etc.). - flags_core.define_base(multi_gpu=False, num_gpu=False) + flags_core.define_base(multi_gpu=False) Hm. Can you ping DistStrat and see if this is expected, and what the advantages of using OneDevice are/will be? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#4457 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AeMko4JS2QTDgYPTO8MAQPuBSy6vtiZAks5t6Zm8gaJpZM4UbmIw> .

…ause params didn't have the TPU parameters.)

k-w-w · 2018-06-11T21:23:37Z

@robieta @karmel PTAL, added synthetic dataset helper and removed the multi gpu flag.

robieta

Just a couple minor things, but overall LGTM.

robieta · 2018-06-11T21:36:48Z

official/utils/flags/_performance.py


      return loss_scale > 0

+  if all_reduce_alg:


This should be an enum.

(Replied in another comment)

robieta · 2018-06-11T21:41:40Z

official/utils/misc/model_helpers.py

  return False
+
+
+def generate_synthetic_data(


Why does this implementation differ from the current synthetic data approach? And could you document why it is better than just the naive .from_tensor_slices(...).repeat(...)?

I think I'm missing something. The implementation is the same as the current synthetic data approach, except that it allows the input/label shapes to be nested.

I think I'm likely the one missing something.

karmel

One nit, but other than that, looks good, thanks

karmel · 2018-06-11T21:43:47Z

official/utils/misc/distribution_utils.py

+
+  Args:
+    num_gpus: Number of GPUs to run this model.
+    all_reduce_alg: Specify which algorithm to use when performing all-reduce.


What are the choices and default here? Might be nice to mention going to look at DistributionStrategies if more detail is desired.

The choices are nccl, hierarchical_copy, or if not specified DistStrat will look at the device topology and choose. (hc if it looks like a DGX, otherwise nccl I believe.) I think we want None to be the default.

A concern I have with limiting the choices here is that more algorithms may be implemented in the future (and might be a hassle to update our code each time). I'll put a mention of DistStrat here.

venuswu · 2019-08-25T10:45:39Z

Hi all, I added DistributionStrategy to the Transformer model. Currently, the model isn't running very well with MirroredStrategy and I'm not sure why. @robieta @guptapriya As people familiar with DistributionStrategy, please help!

Current stats:

GPUs Global steps/sec batch size (per device)
1 GPU 1.11 4096
4 GPU 0.34 3072
I decreased the batch size because of OOM errors.

I have ran the transformer, it seems to be very slow. @robieta

k-w-w requested review from qlzh727, robieta and guptapriya June 5, 2018 21:35

k-w-w requested review from karmel and a team as code owners June 5, 2018 21:36

googlebot added the cla: yes label Jun 5, 2018

karmel suggested changes Jun 5, 2018

View reviewed changes

k-w-w requested review from derekjchow, jch1, pkulzc and rezama as code owners June 7, 2018 00:15

k-w-w removed request for rezama, jch1, pkulzc and derekjchow June 7, 2018 00:24

k-w-w added 8 commits June 6, 2018 17:26

Add DistributionStrategy to transformer model

c568197

add num_gpu flag

f219bf2

Calculate per device batch size for transformer

435c520

remove reference to flags_core

ecf6016

Add synthetic data option to transformer

5a2e2d8

fix typo

6aba1a6

add import back in

4a8012c

Use hierarchical copy

b58cf55

k-w-w force-pushed the t-multi branch from 0872150 to b58cf55 Compare June 7, 2018 00:35

k-w-w added 2 commits June 6, 2018 17:45

address PR comments

44bcf95

lint

bb24e2d

qlzh727 approved these changes Jun 7, 2018

View reviewed changes

fix spaces

da65f50

group train op together to fix single GPU error

7f59710

robieta suggested changes Jun 7, 2018

View reviewed changes

karmel suggested changes Jun 7, 2018

View reviewed changes

k-w-w added 6 commits June 8, 2018 17:59

Fix translate bug (sorted_keys is a dict, not a list)

fff90b7

Change params to a default dict (translate.py was throwing errors bec…

afd67a5

…ause params didn't have the TPU parameters.)

Merge branch 't-multi' of github.com:tensorflow/models into t-multi

fd85a6b

Address PR comments. Removed multi gpu flag + more

cf98266

fix lint

690e6ad

fix more lints

59b4307

add todo for Synthetic dataset

df47b05

robieta reviewed Jun 11, 2018

View reviewed changes

karmel approved these changes Jun 11, 2018

View reviewed changes

Update docs

b217e8d

robieta approved these changes Jun 12, 2018

View reviewed changes

k-w-w merged commit 29c9f98 into master Jun 12, 2018

k-w-w deleted the t-multi branch June 12, 2018 16:54

		import tensorflow as tf


		def get_distribution_strategy(num_gpus, use_hierarchical_copy=False):

Transformer multi gpu #4457

Transformer multi gpu #4457

Uh oh!

Conversation

k-w-w commented Jun 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karmel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robieta commented Jun 6, 2018 • edited by k-w-w Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k-w-w commented Jun 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qlzh727 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k-w-w Jun 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k-w-w commented Jun 7, 2018 via email

Uh oh!

k-w-w commented Jun 11, 2018

Uh oh!

robieta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karmel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

venuswu commented Aug 25, 2019

k-w-w commented Jun 5, 2018 •

edited

Loading

robieta commented Jun 6, 2018 •

edited by k-w-w

Loading

k-w-w commented Jun 7, 2018 •

edited

Loading

k-w-w Jun 7, 2018 •

edited

Loading