-
Notifications
You must be signed in to change notification settings - Fork 45.5k
Transformer multi gpu #4457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer multi gpu #4457
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this works; have you validated that it does for CPU, 1-GPU, and multi-GPU?
else: | ||
params["batch_size"] = distribution_utils.per_device_batch_size( | ||
flags_obj.batch_size or params["default_batch_size"], | ||
flags_core.get_num_gpus(flags_obj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cleaner:
params["batch_size"] = flags_obj.batch_size or params["default_batch_size_tpu"]
if not params["use_tpu"]:
params["batch_size"] = distribution_utils.per_device_batch_size(
params["batch_size"], flags_core.get_num_gpus(flags_obj))
# limitations under the License. | ||
# ============================================================================== | ||
"""Helper functions for running models in a distributed setting.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to selves: Build files will be required for this.
|
||
remainder = batch_size % num_gpus | ||
if remainder: | ||
err = ('When running with multiple GPUs, batch size ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: double quotes
FYI training blows up with synthetic data if the data_dir doesn't exist, even though it isn't used. edit (kathy): can't reply, but this has been fixed. Had a typo that prevented the synthetic data flag from being seen. |
This should be good to be reviewed. Thanks @guptapriya, @robieta, and @yuefengz for the DistributionStrategy help! We've determined that the embedding is slowing the model down. A feature request has been made to Dist Strat. With the hierarchical all_reduce setting, the speeds are more reasonable:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding distribution_utils
official/resnet/resnet_run_loop.py
Outdated
num_gpus=flags_core.get_num_gpus(flags_obj) | ||
) | ||
distribution_strategy = distribution_utils.get_distribution_strategy( | ||
flags_core.get_num_gpus(flags_obj), use_hierarchical_copy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if the all_reduce algorithm was a performance flag since the optimal one will vary by hardware. For ResNet in particular, DistStrat auto selection should be fine in a week or so.
dataset = tf.data.Dataset.from_tensors(tf.ones([batch, length], tf.int64)) | ||
dataset = dataset.map(lambda x: (x, x)) | ||
dataset = dataset.cache() | ||
dataset = dataset.repeat(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that we don't hard code a dummy size into the synthetic data. Because there is already support for setting a step based schedule, infinite repeat should be fine. (This is what ResNet does.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See similar message in #4476 , but we should make a helper fn for getting synthetic data in a particular shape... don't both do that, though.
"base": model_params.BASE_PARAMS, | ||
"base_multi_gpu": model_params.BASE_MULTI_GPU_PARAMS, | ||
"big": model_params.BIG_PARAMS, | ||
"big_multi_gpu": model_params.BIG_MULTI_GPU_PARAMS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it might be better to exclude these from choices, and hot-swap them out if there are multiple GPUs, logging a message so that people know it happened. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was conflicted about the choice between swapping the parameters vs creating a separate set of params were about equal. Usability-wise, I prefer swapping the params in the multi-gpu case so the running commands don't have to change as much. I ended up choosing the other option to maintain the idea to keep models "mathematically equivalent" when swapping to multiple GPUs (like how the batch size is global instead of per-device).
I don't have strong preferences. Do you still think we should go with swapping the params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking have both sets of params, but don't include the multi-GPU in the mapping dict. Then, if multi GPU, swap the entire param set up front.
"""Add flags and flag validators for running transformer_main.""" | ||
# Add common flags (data_dir, model_dir, train_epochs, etc.). | ||
flags_core.define_base(multi_gpu=False, num_gpu=False) | ||
flags_core.define_base(multi_gpu=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the multi_gpu=False here, and it is deceptive, so better to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should remove the multi_gpu
flag, and use num_gpu
instead? MNIST uses multi_gpu
but not num_gpu
, and the resnet models use num_gpu
but not multi_gpu
.
I have a question/suggestion about the num_gpu
flag (+@robieta about this) - Currently, the default value of num_gpu
is 0 or 1 depending on if there is a GPU. Can we change this default to None instead?
OneDeviceStrategy requires more memory than having no DistributionStrategy (batch size of 4096 causes OOM errors when using OneDeviceStrategy). When the user doesn't specify num_gpus, we should default to using no DistributionStrategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Can you ping DistStrat and see if this is expected, and what the advantages of using OneDevice are/will be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, tried running it again, and it went through without failing. Seems like this might be Transformer getting unlucky with the dynamic batching? For now, I'll leave it as is.
What are your thoughts about the num_gpu
vs multi_gpu
flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to get everything on num_gpus
and remove multi_gpu
altogether.
dataset = tf.data.Dataset.from_tensors(tf.ones([batch, length], tf.int64)) | ||
dataset = dataset.map(lambda x: (x, x)) | ||
dataset = dataset.cache() | ||
dataset = dataset.repeat(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See similar message in #4476 , but we should make a helper fn for getting synthetic data in a particular shape... don't both do that, though.
import tensorflow as tf | ||
|
||
|
||
def get_distribution_strategy(num_gpus, use_hierarchical_copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args, returns? In particular, a description of hierarchical_copy would be good.
That's a good point. I'll send them a message.
…On Thu, Jun 7, 2018 at 2:23 PM Taylor Robie ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In official/transformer/transformer_main.py
<#4457 (comment)>:
> @@ -353,12 +360,12 @@ def run_loop(
def define_transformer_flags():
"""Add flags and flag validators for running transformer_main."""
# Add common flags (data_dir, model_dir, train_epochs, etc.).
- flags_core.define_base(multi_gpu=False, num_gpu=False)
+ flags_core.define_base(multi_gpu=False)
Hm. Can you ping DistStrat and see if this is expected, and what the
advantages of using OneDevice are/will be?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#4457 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AeMko4JS2QTDgYPTO8MAQPuBSy6vtiZAks5t6Zm8gaJpZM4UbmIw>
.
|
…ause params didn't have the TPU parameters.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple minor things, but overall LGTM.
|
||
return loss_scale > 0 | ||
|
||
if all_reduce_alg: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Replied in another comment)
return False | ||
|
||
|
||
def generate_synthetic_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this implementation differ from the current synthetic data approach? And could you document why it is better than just the naive .from_tensor_slices(...).repeat(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something. The implementation is the same as the current synthetic data approach, except that it allows the input/label shapes to be nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm likely the one missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, but other than that, looks good, thanks
Args: | ||
num_gpus: Number of GPUs to run this model. | ||
all_reduce_alg: Specify which algorithm to use when performing all-reduce. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the choices and default here? Might be nice to mention going to look at DistributionStrategies if more detail is desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choices are nccl
, hierarchical_copy
, or if not specified DistStrat will look at the device topology and choose. (hc if it looks like a DGX, otherwise nccl I believe.) I think we want None to be the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A concern I have with limiting the choices here is that more algorithms may be implemented in the future (and might be a hassle to update our code each time). I'll put a mention of DistStrat here.
I have ran the transformer, it seems to be very slow. @robieta |
Hi all, I added DistributionStrategy to the Transformer model. Currently, the model isn't running very well with MirroredStrategy and I'm not sure why. @robieta @guptapriya As people familiar with DistributionStrategy, please help!
Current stats:
I decreased the batch size because of OOM errors.