New feature: Large Model Support contrib module for training large models #19845

tungld · 2018-06-08T01:55:28Z

This PR proposes a new module, namedlms, in contrib, which helps TensorFlow with training large models that cannot be fit into GPU memory.

Input is a computational graph defined by users, and our module automatically adds swap-in and swap-out nodes to the graph for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a TensorFlow session actually starts.

With this PR and a Power machine coupled with P100 NVIDIA GPU (16GB memory), we are able to train ResNet-50 with a mini-batch size of 800 (~4x larger than the one without this PR), 3DUnet with full image sizes (192^3 images). Performance degradation is small ranging from 10% to 30% depending on neural networks and mini-batch sizes for training.

Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>

viirya · 2018-06-19T05:05:09Z

tensorflow/contrib/BUILD

@@ -65,6 +65,7 @@ py_library(
        "//tensorflow/contrib/linalg:linalg_py",
        "//tensorflow/contrib/linear_optimizer:sdca_estimator_py",
        "//tensorflow/contrib/linear_optimizer:sdca_ops_py",
+	"//tensorflow/contrib/lms:lms_py",


Make consistent indent?

I don't see indent went wrong when opening the file by emacs or vim. Could you please confirm again?

Oh, I see. Thanks.

I think the indent was done via a tab, while all the other lines were done via spaces. It probably should be spaces to be consistent.

@wdirons thanks. I will check and change to spaces.

viirya · 2018-06-19T05:16:58Z

tensorflow/contrib/lms/README.md

+
+_ub_ :: Upperbound value for LMS. Default `10000`.
+
+_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`.


If it improves the performance, why False by default? Any side effect?

Performance may be improved but the maximum batch size we are able to train is decreased because tensors are kept in GPU for a longer time to be reused by operations.

drpngx · 2018-06-20T02:43:39Z

OK, this is a pretty large CL. It will take some time to think about this.

One quick high-level comment: "lms" should be expanded to something like large_model to make it more readable.

drpngx · 2018-07-31T19:53:20Z

@yuefengz WDYT?

drpngx · 2018-08-10T18:28:04Z

@yuefengz @benoitsteiner could you comment on this?

yuefengz · 2018-09-11T17:05:29Z

@allenlavoie Allen, could you take a look?

allenlavoie · 2018-09-11T17:45:49Z

One high-level question is how this relates to Grappler's memory optimizer. There is a swapping heuristic which is on by default:

tensorflow/tensorflow/core/grappler/optimizers/memory_optimizer.cc

Line 1109 in 36568a8

bool SwappingPass(RewriterConfig::MemOptType optimization_level,

In general we've been leaning toward optimizing things by default when possible rather than providing opt-in utilities. Is there something here we could merge into Grappler to reach more people without requiring them to configure it? (But if there are optimizations too experimental/aggressive to turn on by default, adding an option to RewriterConfig is possible.)

Having rewrites in Python is also going to limit their impact significantly. Grappler gets run in quite a few places, for example when defining graph functions while executing eagerly.

The second issue is that contrib is going away, and we're trying to reduce rather than increase the number of contrib projects. @martinwicke would have a bit more context on this, but I think the short summary is that we'd prefer things that can't/won't be merged into core to live in their own repos. But if these rewrites could be contributed to Grappler, maybe the documentation/examples could live with Grappler too (rather than in a contrib/ directory)?

martinwicke · 2018-09-11T18:05:44Z

I am sorry this was left lingering for so long, but we will not accept new projects to contrib (see also github.com/tensorflow/community/pull/18).

We would prefer this be maintained in its own repo, or merged into grappler. The latter is decidedly preferred since it'll make sure this gets used as much as possible.

tungld · 2018-09-19T01:41:27Z

Thank @allenlavoie and @martinwicke for your comments!
We understand the situation. We will move this PR to our own repository.
Ideas in the PR can be merged into grappler, but at the moment we don't have enough time to do so.

byronyi · 2018-09-19T02:53:37Z

The use of Power seems really interesting. AFAIK, Power employs NVLink between CPU and GPU so the swapping is significantly faster than PCIe. Could you shed some light on its performance characteristics as most TF users are not familiar with the Power platform? We will be interested to take over this project if we are convinced that it could be more valuable to TF users combined with the alternative hardware platform.

smatzek · 2018-09-19T15:47:24Z

@martinwicke @allenlavoie The graph modifications with this are being done statically before the model is run in a session. These graph modifications could likely be done at the grappler level, but we would probably want them turned off by default and have the modification tuneables (number of tensors to swap, how soon to trigger the swap-ins, etc) be set by the RewriterConfig. They were initially written at the Python level to allow faster prototyping, experimentation, and research.

In practice we've seen the swapping accomplished by this module far out perform the swapping that was in the memory optimizer in grappler as of TF 1.8. Using TensorFlow High Performance Models (HPM) we were able to measure the memory gains in terms of both batch size and image size. With Resnet50 and Resnet152 we are able to train with 5x and 4.6x the batch size before running out of memory. We also modified GoogleNet in HPM to allow the image resolution to be changed. We were able to train with 2.5x higher image resolutions before going OOM. Using a 3DUnet model for 3D image segmentation we were able to achieve 2.4x the 3D image resolution.

To understand the benefit of moving these modifications to the memory_optimizer in Grappler it would be helpful to know the future role of grappler. TensorFlow 2.0 will likely have eager execution enabled by default. In such a mode, the optimizations in Grappler are N/A, correct? Is Grappler's role in 2.0 going to be limited to "production" runs where eager is turned off and the graph is available? What is the future direction for the other optimizations in Grappler given the general unavailability of the graph in eager mode?

As for @byronyi's questions about the POWER architecture, yes, it does have NVIDIA NVLink connections between the CPU and GPU as compared to PCIe Gen3 connections between CPU and GPU that other architectures have. The POWER architecture also has a much faster bus between system memory and the CPU. The combination of these faster buses allows this type of tensor swapping to run with far less overhead than on PCIe connected GPUs. A case study investigating this exists if you want more information about the model accuracy gains this tensor swapping produces with 3D MRIs and how it performs on different architectures.

https://developer.ibm.com/linuxonpower/2018/07/27/tensorflow-large-model-support-case-study-3d-image-segmentation/

martinwicke · 2018-09-19T16:21:36Z

Grappler optimizations will be available in 2.0 for any code which is inside a function (decorated with @defun). We believe that will be most code, so graph-level optimizations will definitely be a thing.

tensorflowbutler · 2018-11-03T12:31:34Z

Nagging Assignee @drpngx: It has been 44 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

smatzek · 2018-11-06T17:51:35Z

The TensorFlow Large Model Support contribution has been changed to a separate module and placed in its own github repository, https://github.com/IBM/tensorflow-large-model-support

drpngx · 2018-11-06T18:10:05Z

Nice! Thank you for updating.

…tensorflow#19845). Thank Samuel D. Matzek from PowerAI team for refactoring the code.

tungld and others added 3 commits June 6, 2018 01:16

Implement Large Model Support for training large models

18a8543

Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>

Edit README

3566642

Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>

Add a callback for Keras

3b39b5e

Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>

googlebot added the cla: yes label Jun 8, 2018

tensorflowbutler assigned drpngx Jun 8, 2018

viirya reviewed Jun 19, 2018

View reviewed changes

drpngx requested a review from yuefengz June 20, 2018 02:42

drpngx added the awaiting review Pull request awaiting review label Jun 20, 2018

tungld added 4 commits June 27, 2018 13:54

Replace a tab by spaces

7e20f9d

Place all swap-in/swap-out operations in a single scope

a73d40f

Remove dangling swap-out operations

da6e5a7

swapout_op is None when enabling fuse_swapins

1ce0b96

tensorflowbutler removed the awaiting review Pull request awaiting review label Jul 17, 2018

yuefengz requested a review from benoitsteiner July 31, 2018 20:50

drpngx added the awaiting review Pull request awaiting review label Aug 10, 2018

yuefengz requested review from allenlavoie and removed request for benoitsteiner September 11, 2018 17:04

tensorflowbutler removed the awaiting review Pull request awaiting review label Sep 12, 2018

drpngx closed this Nov 6, 2018

tungld added a commit to IBM/tensorflow-large-model-support that referenced this pull request Apr 22, 2020

Pull changes from the PR#19845 in the vanilla TensorFlow (tensorflow/…

d399899

…tensorflow#19845). Thank Samuel D. Matzek from PowerAI team for refactoring the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature: Large Model Support contrib module for training large models #19845

New feature: Large Model Support contrib module for training large models #19845

tungld commented Jun 8, 2018

viirya Jun 19, 2018

tungld Jun 20, 2018

viirya Jun 20, 2018

wdirons Jun 21, 2018

tungld Jun 27, 2018

viirya Jun 19, 2018

tungld Jun 20, 2018

drpngx commented Jun 20, 2018

drpngx commented Jul 31, 2018

drpngx commented Aug 10, 2018

yuefengz commented Sep 11, 2018

allenlavoie commented Sep 11, 2018

martinwicke commented Sep 11, 2018

tungld commented Sep 19, 2018

byronyi commented Sep 19, 2018

smatzek commented Sep 19, 2018

martinwicke commented Sep 19, 2018

tensorflowbutler commented Nov 3, 2018

smatzek commented Nov 6, 2018

drpngx commented Nov 6, 2018


		_ub_ :: Upperbound value for LMS. Default `10000`.

		_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`.

New feature: Large Model Support contrib module for training large models #19845

New feature: Large Model Support contrib module for training large models #19845

Conversation

tungld commented Jun 8, 2018

viirya Jun 19, 2018

Choose a reason for hiding this comment

tungld Jun 20, 2018

Choose a reason for hiding this comment

viirya Jun 20, 2018

Choose a reason for hiding this comment

wdirons Jun 21, 2018

Choose a reason for hiding this comment

tungld Jun 27, 2018

Choose a reason for hiding this comment

viirya Jun 19, 2018

Choose a reason for hiding this comment

tungld Jun 20, 2018

Choose a reason for hiding this comment

drpngx commented Jun 20, 2018

drpngx commented Jul 31, 2018

drpngx commented Aug 10, 2018

yuefengz commented Sep 11, 2018

allenlavoie commented Sep 11, 2018

martinwicke commented Sep 11, 2018

tungld commented Sep 19, 2018

byronyi commented Sep 19, 2018

smatzek commented Sep 19, 2018

martinwicke commented Sep 19, 2018

tensorflowbutler commented Nov 3, 2018

smatzek commented Nov 6, 2018

drpngx commented Nov 6, 2018