Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Large Model Support contrib module for training large models #19845

Closed
wants to merge 7 commits into from

Conversation

tungld
Copy link

@tungld tungld commented Jun 8, 2018

This PR proposes a new module, namedlms, in contrib, which helps TensorFlow with training large models that cannot be fit into GPU memory.

Input is a computational graph defined by users, and our module automatically adds swap-in and swap-out nodes to the graph for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a TensorFlow session actually starts.

With this PR and a Power machine coupled with P100 NVIDIA GPU (16GB memory), we are able to train ResNet-50 with a mini-batch size of 800 (~4x larger than the one without this PR), 3DUnet with full image sizes (192^3 images). Performance degradation is small ranging from 10% to 30% depending on neural networks and mini-batch sizes for training.

tungld and others added 3 commits June 6, 2018 01:16
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
@@ -65,6 +65,7 @@ py_library(
"//tensorflow/contrib/linalg:linalg_py",
"//tensorflow/contrib/linear_optimizer:sdca_estimator_py",
"//tensorflow/contrib/linear_optimizer:sdca_ops_py",
"//tensorflow/contrib/lms:lms_py",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make consistent indent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see indent went wrong when opening the file by emacs or vim. Could you please confirm again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the indent was done via a tab, while all the other lines were done via spaces. It probably should be spaces to be consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wdirons thanks. I will check and change to spaces.


_ub_ :: Upperbound value for LMS. Default `10000`.

_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it improves the performance, why False by default? Any side effect?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance may be improved but the maximum batch size we are able to train is decreased because tensors are kept in GPU for a longer time to be reused by operations.

@drpngx drpngx requested a review from yuefengz June 20, 2018 02:42
@drpngx
Copy link
Contributor

drpngx commented Jun 20, 2018

OK, this is a pretty large CL. It will take some time to think about this.

One quick high-level comment: "lms" should be expanded to something like large_model to make it more readable.

@drpngx drpngx added the awaiting review Pull request awaiting review label Jun 20, 2018
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Jul 17, 2018
@drpngx
Copy link
Contributor

drpngx commented Jul 31, 2018

@yuefengz WDYT?

@drpngx drpngx added the awaiting review Pull request awaiting review label Aug 10, 2018
@drpngx
Copy link
Contributor

drpngx commented Aug 10, 2018

@yuefengz @benoitsteiner could you comment on this?

@yuefengz yuefengz requested review from allenlavoie and removed request for benoitsteiner September 11, 2018 17:04
@yuefengz
Copy link
Contributor

@allenlavoie Allen, could you take a look?

@allenlavoie
Copy link
Member

One high-level question is how this relates to Grappler's memory optimizer. There is a swapping heuristic which is on by default:

bool SwappingPass(RewriterConfig::MemOptType optimization_level,

In general we've been leaning toward optimizing things by default when possible rather than providing opt-in utilities. Is there something here we could merge into Grappler to reach more people without requiring them to configure it? (But if there are optimizations too experimental/aggressive to turn on by default, adding an option to RewriterConfig is possible.)

Having rewrites in Python is also going to limit their impact significantly. Grappler gets run in quite a few places, for example when defining graph functions while executing eagerly.

The second issue is that contrib is going away, and we're trying to reduce rather than increase the number of contrib projects. @martinwicke would have a bit more context on this, but I think the short summary is that we'd prefer things that can't/won't be merged into core to live in their own repos. But if these rewrites could be contributed to Grappler, maybe the documentation/examples could live with Grappler too (rather than in a contrib/ directory)?

@martinwicke
Copy link
Member

I am sorry this was left lingering for so long, but we will not accept new projects to contrib (see also github.com/tensorflow/community/pull/18).

We would prefer this be maintained in its own repo, or merged into grappler. The latter is decidedly preferred since it'll make sure this gets used as much as possible.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Sep 12, 2018
@tungld
Copy link
Author

tungld commented Sep 19, 2018

Thank @allenlavoie and @martinwicke for your comments!
We understand the situation. We will move this PR to our own repository.
Ideas in the PR can be merged into grappler, but at the moment we don't have enough time to do so.

@byronyi
Copy link
Contributor

byronyi commented Sep 19, 2018

The use of Power seems really interesting. AFAIK, Power employs NVLink between CPU and GPU so the swapping is significantly faster than PCIe. Could you shed some light on its performance characteristics as most TF users are not familiar with the Power platform? We will be interested to take over this project if we are convinced that it could be more valuable to TF users combined with the alternative hardware platform.

@smatzek
Copy link
Contributor

smatzek commented Sep 19, 2018

@martinwicke @allenlavoie The graph modifications with this are being done statically before the model is run in a session. These graph modifications could likely be done at the grappler level, but we would probably want them turned off by default and have the modification tuneables (number of tensors to swap, how soon to trigger the swap-ins, etc) be set by the RewriterConfig. They were initially written at the Python level to allow faster prototyping, experimentation, and research.

In practice we've seen the swapping accomplished by this module far out perform the swapping that was in the memory optimizer in grappler as of TF 1.8. Using TensorFlow High Performance Models (HPM) we were able to measure the memory gains in terms of both batch size and image size. With Resnet50 and Resnet152 we are able to train with 5x and 4.6x the batch size before running out of memory. We also modified GoogleNet in HPM to allow the image resolution to be changed. We were able to train with 2.5x higher image resolutions before going OOM. Using a 3DUnet model for 3D image segmentation we were able to achieve 2.4x the 3D image resolution.

To understand the benefit of moving these modifications to the memory_optimizer in Grappler it would be helpful to know the future role of grappler. TensorFlow 2.0 will likely have eager execution enabled by default. In such a mode, the optimizations in Grappler are N/A, correct? Is Grappler's role in 2.0 going to be limited to "production" runs where eager is turned off and the graph is available? What is the future direction for the other optimizations in Grappler given the general unavailability of the graph in eager mode?

As for @byronyi's questions about the POWER architecture, yes, it does have NVIDIA NVLink connections between the CPU and GPU as compared to PCIe Gen3 connections between CPU and GPU that other architectures have. The POWER architecture also has a much faster bus between system memory and the CPU. The combination of these faster buses allows this type of tensor swapping to run with far less overhead than on PCIe connected GPUs. A case study investigating this exists if you want more information about the model accuracy gains this tensor swapping produces with 3D MRIs and how it performs on different architectures.

https://developer.ibm.com/linuxonpower/2018/07/27/tensorflow-large-model-support-case-study-3d-image-segmentation/

@martinwicke
Copy link
Member

Grappler optimizations will be available in 2.0 for any code which is inside a function (decorated with @defun). We believe that will be most code, so graph-level optimizations will definitely be a thing.

@tensorflowbutler
Copy link
Member

Nagging Assignee @drpngx: It has been 44 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@smatzek
Copy link
Contributor

smatzek commented Nov 6, 2018

The TensorFlow Large Model Support contribution has been changed to a separate module and placed in its own github repository, https://github.com/IBM/tensorflow-large-model-support

@drpngx
Copy link
Contributor

drpngx commented Nov 6, 2018

Nice! Thank you for updating.

@drpngx drpngx closed this Nov 6, 2018
tungld added a commit to IBM/tensorflow-large-model-support that referenced this pull request Apr 22, 2020
…tensorflow#19845). Thank Samuel D. Matzek from PowerAI team for refactoring the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet