New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Large Model Support contrib module for training large models #19845

Closed
wants to merge 7 commits into
base: master
from

Conversation

Projects
None yet
@tungld

tungld commented Jun 8, 2018

This PR proposes a new module, namedlms, in contrib, which helps TensorFlow with training large models that cannot be fit into GPU memory.

Input is a computational graph defined by users, and our module automatically adds swap-in and swap-out nodes to the graph for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a TensorFlow session actually starts.

With this PR and a Power machine coupled with P100 NVIDIA GPU (16GB memory), we are able to train ResNet-50 with a mini-batch size of 800 (~4x larger than the one without this PR), 3DUnet with full image sizes (192^3 images). Performance degradation is small ranging from 10% to 30% depending on neural networks and mini-batch sizes for training.

tungld and others added some commits Jun 5, 2018

Implement Large Model Support for training large models
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
Edit README
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
Add a callback for Keras
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
@@ -65,6 +65,7 @@ py_library(
"//tensorflow/contrib/linalg:linalg_py",
"//tensorflow/contrib/linear_optimizer:sdca_estimator_py",
"//tensorflow/contrib/linear_optimizer:sdca_ops_py",
"//tensorflow/contrib/lms:lms_py",

This comment has been minimized.

@viirya

viirya Jun 19, 2018

Contributor

Make consistent indent?

This comment has been minimized.

@tungld

tungld Jun 20, 2018

I don't see indent went wrong when opening the file by emacs or vim. Could you please confirm again?

This comment has been minimized.

@viirya

viirya Jun 20, 2018

Contributor

Oh, I see. Thanks.

This comment has been minimized.

@wdirons

wdirons Jun 21, 2018

Contributor

I think the indent was done via a tab, while all the other lines were done via spaces. It probably should be spaces to be consistent.

This comment has been minimized.

@tungld

tungld Jun 27, 2018

@wdirons thanks. I will check and change to spaces.

_ub_ :: Upperbound value for LMS. Default `10000`.
_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`.

This comment has been minimized.

@viirya

viirya Jun 19, 2018

Contributor

If it improves the performance, why False by default? Any side effect?

This comment has been minimized.

@tungld

tungld Jun 20, 2018

Performance may be improved but the maximum batch size we are able to train is decreased because tensors are kept in GPU for a longer time to be reused by operations.

@drpngx drpngx requested a review from yuefengz Jun 20, 2018

@drpngx

This comment has been minimized.

Member

drpngx commented Jun 20, 2018

OK, this is a pretty large CL. It will take some time to think about this.

One quick high-level comment: "lms" should be expanded to something like large_model to make it more readable.

@drpngx

This comment has been minimized.

Member

drpngx commented Jul 31, 2018

@yuefengz WDYT?

@yuefengz yuefengz requested a review from benoitsteiner Jul 31, 2018

@drpngx

This comment has been minimized.

Member

drpngx commented Aug 10, 2018

@yuefengz @benoitsteiner could you comment on this?

@yuefengz yuefengz requested review from allenlavoie and removed request for benoitsteiner Sep 11, 2018

@yuefengz

This comment has been minimized.

Member

yuefengz commented Sep 11, 2018

@allenlavoie Allen, could you take a look?

@allenlavoie

This comment has been minimized.

Member

allenlavoie commented Sep 11, 2018

One high-level question is how this relates to Grappler's memory optimizer. There is a swapping heuristic which is on by default:

bool SwappingPass(RewriterConfig::MemOptType optimization_level,

In general we've been leaning toward optimizing things by default when possible rather than providing opt-in utilities. Is there something here we could merge into Grappler to reach more people without requiring them to configure it? (But if there are optimizations too experimental/aggressive to turn on by default, adding an option to RewriterConfig is possible.)

Having rewrites in Python is also going to limit their impact significantly. Grappler gets run in quite a few places, for example when defining graph functions while executing eagerly.

The second issue is that contrib is going away, and we're trying to reduce rather than increase the number of contrib projects. @martinwicke would have a bit more context on this, but I think the short summary is that we'd prefer things that can't/won't be merged into core to live in their own repos. But if these rewrites could be contributed to Grappler, maybe the documentation/examples could live with Grappler too (rather than in a contrib/ directory)?

@martinwicke

This comment has been minimized.

Member

martinwicke commented Sep 11, 2018

I am sorry this was left lingering for so long, but we will not accept new projects to contrib (see also github.com/tensorflow/community/pull/18).

We would prefer this be maintained in its own repo, or merged into grappler. The latter is decidedly preferred since it'll make sure this gets used as much as possible.

@tungld

This comment has been minimized.

tungld commented Sep 19, 2018

Thank @allenlavoie and @martinwicke for your comments!
We understand the situation. We will move this PR to our own repository.
Ideas in the PR can be merged into grappler, but at the moment we don't have enough time to do so.

@byronyi

This comment has been minimized.

Contributor

byronyi commented Sep 19, 2018

The use of Power seems really interesting. AFAIK, Power employs NVLink between CPU and GPU so the swapping is significantly faster than PCIe. Could you shed some light on its performance characteristics as most TF users are not familiar with the Power platform? We will be interested to take over this project if we are convinced that it could be more valuable to TF users combined with the alternative hardware platform.

@smatzek

This comment has been minimized.

Contributor

smatzek commented Sep 19, 2018

@martinwicke @allenlavoie The graph modifications with this are being done statically before the model is run in a session. These graph modifications could likely be done at the grappler level, but we would probably want them turned off by default and have the modification tuneables (number of tensors to swap, how soon to trigger the swap-ins, etc) be set by the RewriterConfig. They were initially written at the Python level to allow faster prototyping, experimentation, and research.

In practice we've seen the swapping accomplished by this module far out perform the swapping that was in the memory optimizer in grappler as of TF 1.8. Using TensorFlow High Performance Models (HPM) we were able to measure the memory gains in terms of both batch size and image size. With Resnet50 and Resnet152 we are able to train with 5x and 4.6x the batch size before running out of memory. We also modified GoogleNet in HPM to allow the image resolution to be changed. We were able to train with 2.5x higher image resolutions before going OOM. Using a 3DUnet model for 3D image segmentation we were able to achieve 2.4x the 3D image resolution.

To understand the benefit of moving these modifications to the memory_optimizer in Grappler it would be helpful to know the future role of grappler. TensorFlow 2.0 will likely have eager execution enabled by default. In such a mode, the optimizations in Grappler are N/A, correct? Is Grappler's role in 2.0 going to be limited to "production" runs where eager is turned off and the graph is available? What is the future direction for the other optimizations in Grappler given the general unavailability of the graph in eager mode?

As for @byronyi's questions about the POWER architecture, yes, it does have NVIDIA NVLink connections between the CPU and GPU as compared to PCIe Gen3 connections between CPU and GPU that other architectures have. The POWER architecture also has a much faster bus between system memory and the CPU. The combination of these faster buses allows this type of tensor swapping to run with far less overhead than on PCIe connected GPUs. A case study investigating this exists if you want more information about the model accuracy gains this tensor swapping produces with 3D MRIs and how it performs on different architectures.

https://developer.ibm.com/linuxonpower/2018/07/27/tensorflow-large-model-support-case-study-3d-image-segmentation/

@martinwicke

This comment has been minimized.

Member

martinwicke commented Sep 19, 2018

Grappler optimizations will be available in 2.0 for any code which is inside a function (decorated with @defun). We believe that will be most code, so graph-level optimizations will definitely be a thing.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Nov 3, 2018

Nagging Assignee @drpngx: It has been 44 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@smatzek

This comment has been minimized.

Contributor

smatzek commented Nov 6, 2018

The TensorFlow Large Model Support contribution has been changed to a separate module and placed in its own github repository, https://github.com/IBM/tensorflow-large-model-support

@drpngx

This comment has been minimized.

Member

drpngx commented Nov 6, 2018

Nice! Thank you for updating.

@drpngx drpngx closed this Nov 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment