New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New feature: Large Model Support contrib module for training large models #19845
Conversation
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
Co-authored-by: Samuel D. Matzek <smatzek@us.ibm.com>
tensorflow/contrib/BUILD
Outdated
@@ -65,6 +65,7 @@ py_library( | |||
"//tensorflow/contrib/linalg:linalg_py", | |||
"//tensorflow/contrib/linear_optimizer:sdca_estimator_py", | |||
"//tensorflow/contrib/linear_optimizer:sdca_ops_py", | |||
"//tensorflow/contrib/lms:lms_py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make consistent indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see indent went wrong when opening the file by emacs or vim. Could you please confirm again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the indent was done via a tab, while all the other lines were done via spaces. It probably should be spaces to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wdirons thanks. I will check and change to spaces.
|
||
_ub_ :: Upperbound value for LMS. Default `10000`. | ||
|
||
_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it improves the performance, why False
by default? Any side effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance may be improved but the maximum batch size we are able to train is decreased because tensors are kept in GPU for a longer time to be reused by operations.
OK, this is a pretty large CL. It will take some time to think about this. One quick high-level comment: "lms" should be expanded to something like large_model to make it more readable. |
@yuefengz WDYT? |
@yuefengz @benoitsteiner could you comment on this? |
@allenlavoie Allen, could you take a look? |
One high-level question is how this relates to Grappler's memory optimizer. There is a swapping heuristic which is on by default:
In general we've been leaning toward optimizing things by default when possible rather than providing opt-in utilities. Is there something here we could merge into Grappler to reach more people without requiring them to configure it? (But if there are optimizations too experimental/aggressive to turn on by default, adding an option to RewriterConfig is possible.) Having rewrites in Python is also going to limit their impact significantly. Grappler gets run in quite a few places, for example when defining graph functions while executing eagerly. The second issue is that contrib is going away, and we're trying to reduce rather than increase the number of contrib projects. @martinwicke would have a bit more context on this, but I think the short summary is that we'd prefer things that can't/won't be merged into core to live in their own repos. But if these rewrites could be contributed to Grappler, maybe the documentation/examples could live with Grappler too (rather than in a contrib/ directory)? |
I am sorry this was left lingering for so long, but we will not accept new projects to contrib (see also github.com/tensorflow/community/pull/18). We would prefer this be maintained in its own repo, or merged into grappler. The latter is decidedly preferred since it'll make sure this gets used as much as possible. |
Thank @allenlavoie and @martinwicke for your comments! |
The use of Power seems really interesting. AFAIK, Power employs NVLink between CPU and GPU so the swapping is significantly faster than PCIe. Could you shed some light on its performance characteristics as most TF users are not familiar with the Power platform? We will be interested to take over this project if we are convinced that it could be more valuable to TF users combined with the alternative hardware platform. |
@martinwicke @allenlavoie The graph modifications with this are being done statically before the model is run in a session. These graph modifications could likely be done at the grappler level, but we would probably want them turned off by default and have the modification tuneables (number of tensors to swap, how soon to trigger the swap-ins, etc) be set by the RewriterConfig. They were initially written at the Python level to allow faster prototyping, experimentation, and research. In practice we've seen the swapping accomplished by this module far out perform the swapping that was in the memory optimizer in grappler as of TF 1.8. Using TensorFlow High Performance Models (HPM) we were able to measure the memory gains in terms of both batch size and image size. With Resnet50 and Resnet152 we are able to train with 5x and 4.6x the batch size before running out of memory. We also modified GoogleNet in HPM to allow the image resolution to be changed. We were able to train with 2.5x higher image resolutions before going OOM. Using a 3DUnet model for 3D image segmentation we were able to achieve 2.4x the 3D image resolution. To understand the benefit of moving these modifications to the memory_optimizer in Grappler it would be helpful to know the future role of grappler. TensorFlow 2.0 will likely have eager execution enabled by default. In such a mode, the optimizations in Grappler are N/A, correct? Is Grappler's role in 2.0 going to be limited to "production" runs where eager is turned off and the graph is available? What is the future direction for the other optimizations in Grappler given the general unavailability of the graph in eager mode? As for @byronyi's questions about the POWER architecture, yes, it does have NVIDIA NVLink connections between the CPU and GPU as compared to PCIe Gen3 connections between CPU and GPU that other architectures have. The POWER architecture also has a much faster bus between system memory and the CPU. The combination of these faster buses allows this type of tensor swapping to run with far less overhead than on PCIe connected GPUs. A case study investigating this exists if you want more information about the model accuracy gains this tensor swapping produces with 3D MRIs and how it performs on different architectures. |
Grappler optimizations will be available in 2.0 for any code which is inside a function (decorated with |
Nagging Assignee @drpngx: It has been 44 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
The TensorFlow Large Model Support contribution has been changed to a separate module and placed in its own github repository, https://github.com/IBM/tensorflow-large-model-support |
Nice! Thank you for updating. |
…tensorflow#19845). Thank Samuel D. Matzek from PowerAI team for refactoring the code.
This PR proposes a new module, named
lms
, incontrib
, which helps TensorFlow with training large models that cannot be fit into GPU memory.Input is a computational graph defined by users, and our module automatically adds swap-in and swap-out nodes to the graph for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a TensorFlow session actually starts.
With this PR and a Power machine coupled with P100 NVIDIA GPU (16GB memory), we are able to train ResNet-50 with a mini-batch size of 800 (~4x larger than the one without this PR), 3DUnet with full image sizes (192^3 images). Performance degradation is small ranging from 10% to 30% depending on neural networks and mini-batch sizes for training.