Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MPS] Fix memory leak in copy_from_mps_ #114197

Closed
wants to merge 1 commit into from

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Nov 21, 2023

By always calling [destBuffer release] before leaving the scope in which it was allocated.
Leak was introduced by #84928
Add regression test.
Before the change:

% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL

======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
    method(*args, **kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
    self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory

To execute this test, run the following from the base repo dir:
     python test/test_mps.py -k test_copy_cast_no_leak

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 1.102s

FAILED (failures=1)

After:

% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
...

Fixes #114096

By always calling `[destBuffer release]` when it was allocated.
Leak was introduced by #84928
Add regression test.
Before the change:
```
% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL

======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
    method(*args, **kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
    self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory

To execute this test, run the following from the base repo dir:
     python test/test_mps.py -k test_copy_cast_no_leak

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 1.102s

FAILED (failures=1)
```
After:
```
% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
...
```

Fixes #114096
@pytorch-bot pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Nov 21, 2023
Copy link

pytorch-bot bot commented Nov 21, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114197

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ba5a4b8 with merge base e4ec554 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@malfet malfet requested review from albanD and a team November 21, 2023 02:19
@malfet
Copy link
Contributor Author

malfet commented Nov 21, 2023

@pytorchbot merge -f "MPS tests are green"

@malfet malfet added the topic: bug fixes topic category label Nov 21, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet malfet added this to the 2.1.2 milestone Nov 22, 2023
@malfet malfet deleted the malfet/mps-fix-copy-cast-memory-leak branch November 30, 2023 22:18
@njhill
Copy link

njhill commented Dec 22, 2023

This has milestone 2.1.2 but doesn't look like it made the 2.1.2 release?

@huydhn
Copy link
Contributor

huydhn commented Dec 22, 2023

This has milestone 2.1.2 but doesn't look like it made the 2.1.2 release?

I double checked and couldn't find this on https://github.com/orgs/pytorch/projects/40, so I think we indeed missed it for 2.1.2. Fortunately, it will be in 2.2.0 which will be available mid Jan

cc @atalman

@malfet
Copy link
Contributor Author

malfet commented Dec 22, 2023

This has milestone 2.1.2 but doesn't look like it made the 2.1.2 release?

@atalman let's debug later how it happen. Have we run the scrip to validate that all milestoned issues/PRs made it into the branch?

@atalman
Copy link
Contributor

atalman commented Dec 27, 2023

@malfet I looked into this issue, auto-add items to the project was wrongly setup query was something like:

is:issue,pr is:open milestone:"2.1.2"

changed it to for 2.2.0:

is:issue,pr milestone:"2.2.0"

Will have to make sure to add running github analyze script before generating final RC : https://github.com/pytorch/pytorch/blob/main/RELEASE.md#cutting-a-release-branch-preparations

@malfet
Copy link
Contributor Author

malfet commented Dec 27, 2023

@atalman I've meant, do we run https://github.com/pytorch/builder/blob/main/analytics/github_analyze.py to detect commits present in milestone, but missing in the branch?

It shows that two PRs are missing from the branch/release:

% python github_analyze.py --repo-path ~/git/pytorch/pytorch --remote origin --branch release/2.1 --milestone-id 39 --missing-in-branch
len(main_commits)=4339
len(release_commits)=132
URL;Title;Status
https://github.com/pytorch/pytorch/pull/114197;[MPS] Fix memory leak in copy_from_mps_;closed
https://github.com/pytorch/pytorch/issues/111574;`illegal memory access` for `torch.sparse.mm(src, other) / deg.view(-1, 1).clamp_(min=1)`;closed

@atalman
Copy link
Contributor

atalman commented Dec 28, 2023

@malfet Running github_analyze step was missed in the release 2.1.2. Sorry about this. Added more documentation around final RC process: #116476

pytorchmergebot pushed a commit that referenced this pull request Dec 28, 2023
…116476)

Formalize process of building and testing final rc. To avoid having missing PRs in the release, similar to this: #114197

Pull Request resolved: #116476
Approved by: https://github.com/huydhn
@atalman atalman modified the milestones: 2.1.2, 2.2.0 Dec 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/mps Run MPS tests (subset of trunk) Merged release notes: mps Release notes category topic: bug fixes topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MPS] Memory leak when converting device and type at the same time using torch.Tensor.to()
6 participants