[Feature] Allow user to specify a fraction of the GPU memory. #48172

CalvinXKY · 2020-11-18T11:34:01Z

Add a new function, torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda. Related: #18626
The fraction (float type, from 0 to 1) is used to limit memory of cashing allocator on GPU device . One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction
Note， this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator.

usage

In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using.
Eg. device: 0, each part takes half memory, the code as follows:

torch.cuda.set_per_process_memory_fraction(0.5, 0)

There is an example to show what it is.

import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')

"""
It raises an error as follows: 
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""

dr-ci · 2020-11-18T11:46:02Z

💊 CI failures summary and remediations

As of commit ada1960 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

codecov.io: 1 failed

Failed: codecov/patch

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 29 times.

VitalyFedyunin · 2020-11-19T00:32:55Z

torch/cuda/memory.py

@@ -72,6 +72,33 @@ def caching_allocator_delete(mem_ptr):
    torch._C._cuda_cudaCachingAllocator_raw_delete(mem_ptr)


+def set_memory_fraction(fraction, device: Union[Device, int] = None) -> None:
+    r"""Set memory fraction for a device.
+    The fraction is used to limit allocated memory on a CUDA device.


The statement is incorrect as you are only limiting memory used for caching allocator, so running tow processes with fraction of 0.5 not going to be possible (but expected as per description) as some of the memory going to be used by operators and cuda context.

@VitalyFedyunin This does not work for multiprocess limiting, becasue all subproesses will get its own space and do not share info to each other. However, in most cases, one can use it to split memory, if he/she knows put the setting in subprocess. So should I just change the statement or do some more, such as to deal with the multiprocess situation?

@VitalyFedyunin hi, I've just changed the statment and the name of this function, to make it more clearly.

codecov · 2020-11-19T12:37:39Z

Codecov Report

Merging #48172 (ada1960) into master (df0ae24) will decrease coverage by 0.00%.
The diff coverage is 10.00%.

@@            Coverage Diff             @@
##           master   #48172      +/-   ##
==========================================
- Coverage   81.26%   81.25%   -0.01%     
==========================================
  Files        1840     1840              
  Lines      198865   198875      +10     
==========================================
- Hits       161598   161588      -10     
- Misses      37267    37287      +20

…ess).

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-12-03T21:14:31Z

@VitalyFedyunin merged this pull request in 47aa253.

CalvinXKY · 2022-01-11T09:19:21Z

@VitalyFedyunin hi, I noticed the issue #58466 and tried to fix it. However, I could not find a way to solve it with only a little change. A way I currently found is to use nvml lib to get the process info and modify the allocator total memory data. Then, user's max memory would be limited on a GPU, but it could not work when in docker env. because, the host process PID does not equal the container one if user does not open docker PID mapping.
Through this way to satisfy part of cases and only need to import "nvml.h" lib and change less than 100 lines code. Is this OK?
Another method is to change function name and clarify its using situation in doc. Any suggestions？

facebook-github-bot added the cla signed label Nov 18, 2020

pytorchbot added the open source label Nov 18, 2020

zhangguanheng66 added the module: cuda Related to torch.cuda, and CUDA support in general label Nov 18, 2020

zhangguanheng66 requested a review from VitalyFedyunin November 18, 2020 20:47

zhangguanheng66 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 18, 2020

VitalyFedyunin reviewed Nov 19, 2020

View reviewed changes

CalvinXKY force-pushed the master branch 4 times, most recently from 58c87d1 to 2296fde Compare November 19, 2020 09:26

CalvinXKY force-pushed the master branch 3 times, most recently from 9ad2288 to ada1960 Compare November 20, 2020 08:37

[Feature] Allow user to specify a fraction of the GPU memory(per proc…

ada1960

…ess).

CalvinXKY mentioned this pull request Nov 23, 2020

The report of codecov is incorrect. #48392

Closed

VitalyFedyunin approved these changes Dec 2, 2020

View reviewed changes

facebook-github-bot reviewed Dec 2, 2020

View reviewed changes

facebook-github-bot closed this in 47aa253 Dec 3, 2020

facebook-github-bot added the Merged label Dec 3, 2020

ngimel mentioned this pull request Feb 11, 2021

[feature request] Set limit on GPU memory use #18626

Closed

ngimel mentioned this pull request May 20, 2021

set_per_process_memory_fraction() does not ensure max used GPU memory below fraction #58466

Open

ORippler mentioned this pull request Dec 9, 2021

torch.cuda.set_per_process_memory_fraction() does not perform VRAM isolation #69688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Allow user to specify a fraction of the GPU memory. #48172

[Feature] Allow user to specify a fraction of the GPU memory. #48172

Uh oh!

CalvinXKY commented Nov 18, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Nov 18, 2020 •

edited

Loading

Uh oh!

VitalyFedyunin Nov 19, 2020

Uh oh!

CalvinXKY Nov 19, 2020

Uh oh!

CalvinXKY Nov 19, 2020

Uh oh!

codecov bot commented Nov 19, 2020 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Dec 3, 2020

Uh oh!

CalvinXKY commented Jan 11, 2022 •

edited

Loading

Uh oh!

Uh oh!

[Feature] Allow user to specify a fraction of the GPU memory. #48172

[Feature] Allow user to specify a fraction of the GPU memory. #48172

Uh oh!

Conversation

CalvinXKY commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

usage

Uh oh!

dr-ci bot commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

codecov.io: 1 failed

Uh oh!

VitalyFedyunin Nov 19, 2020

Choose a reason for hiding this comment

Uh oh!

CalvinXKY Nov 19, 2020

Choose a reason for hiding this comment

Uh oh!

CalvinXKY Nov 19, 2020

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Dec 3, 2020

Uh oh!

CalvinXKY commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

CalvinXKY commented Nov 18, 2020 •

edited

Loading

dr-ci bot commented Nov 18, 2020 •

edited

Loading

codecov bot commented Nov 19, 2020 •

edited

Loading

CalvinXKY commented Jan 11, 2022 •

edited

Loading