Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Results of In-Place Zeroing on Sliced Tensor with MPS Device #114692

Closed
Tripton opened this issue Nov 28, 2023 · 4 comments
Closed

Unexpected Results of In-Place Zeroing on Sliced Tensor with MPS Device #114692

Tripton opened this issue Nov 28, 2023 · 4 comments
Assignees
Labels
high priority module: mps Related to Apple Metal Performance Shaders framework triage review

Comments

@Tripton
Copy link

Tripton commented Nov 28, 2023

馃悰 Unexpected Results of In-Place Zeroing on Sliced Tensor with MPS Device

Performing in-place zeroing on a sliced PyTorch tensor located on the MPS device leads to unexpected results, which differs from the behavior observed on CPU or CUDA device. The issue is demonstrated in the following code snippet:

import torch

# For reference we get the result on cpu
a = torch.ones(3).to('cpu')
a[1].zero_()
print(a) # -> tensor([1., 0., 1.])

# Now we do the same on MPS
b = torch.ones(3).to('mps')
b[1].zero_()
print(b) # -> tensor([1., 0., 0.], device='mps:0')

In this example, tensor a on the CPU is correctly zeroed in-place, where only index 1 is set to zero, resulting in the expected output: tensor([1., 0., 1.]). However, tensor b on the MPS device produces unexpected results. The in-place zeroing operation on the sliced tensor b correctly sets index 1 to 0, but unexpectedly also sets index 2 to 0, leading to the output: tensor([1, 0., 0.], device='mps:0').

In addition to the previously mentioned potential bug, we observed that performing in-place filling with zero values using torch.fill_ on a PyTorch tensor located on the MPS device also produces unexpected results. The issue can be seen in the following code snippet:

import torch

c = torch.ones(3).to('mps')
c[1].fill_(2)
print(c)  # -> tensor([1., 2., 1.])

d = torch.ones(3).to('mps')
d[1].fill_(0)
print(d)  # -> tensor([1., 0., 0.], device='mps:0')

It is interesting that the unexpected behavior with torch.fill_ seems to be specific to 0 and doesn't seem to occur for other values.

Note: The issue was initially noticed when incorrect results for Swin Transformer V2 on MPS were observed. Further investigation revealed that in the torchvision implementation of the shifted_window_attention function, the qkv_bias is obtained with in-place zeroing similar to the provided example.

Versions

PyTorch version: 2.1.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.1.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.0.40.1)
CMake version: version 3.27.6
Libc version: N/A

Python version: 3.8.18 (default, Sep 11 2023, 08:17:16) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.1.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] numpydoc==1.5.0
[pip3] onnx==1.13.1
[pip3] torch==2.1.1
[pip3] torchaudio==2.1.1
[pip3] torchmetrics==0.10.2
[pip3] torchvision==0.16.1
[conda] nomkl 3.0 0
[conda] numpy 1.24.3 py38h1398885_0
[conda] numpy-base 1.24.3 py38h90707a3_0
[conda] numpydoc 1.5.0 py38hca03da5_0
[conda] pytorch 2.1.1 py3.8_0 pytorch
[conda] torchaudio 2.1.1 py38_cpu pytorch
[conda] torchmetrics 0.10.2 pypi_0 pypi
[conda] torchvision 0.16.1 py38_cpu pytorch

cc @ezyang @gchanan @zou3519 @kadeng @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr

@mikaylagawarecki mikaylagawarecki added module: mps Related to Apple Metal Performance Shaders framework high priority labels Nov 28, 2023
@mikaylagawarecki mikaylagawarecki added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 28, 2023
@mikaylagawarecki
Copy link
Contributor

mikaylagawarecki commented Nov 28, 2023

Marking as high priority for silent correctness issues

@malfet
Copy link
Contributor

malfet commented Nov 30, 2023

I think I know what is going on :(

// check if it's possible to use fillBuffer() to fill the Tensor's storage
if (value.toDouble() == 0.0 && fill_mps_tensor_(self, 0) == true)
return self;

@malfet
Copy link
Contributor

malfet commented Nov 30, 2023

And here is the fix, PR is coming:

diff --git a/aten/src/ATen/native/mps/operations/ConstantOps.mm b/aten/src/ATen/native/mps/operations/ConstantOps.mm
index 0a137aa86d7..52c74c3637e 100644
--- a/aten/src/ATen/native/mps/operations/ConstantOps.mm
+++ b/aten/src/ATen/native/mps/operations/ConstantOps.mm
@@ -79,7 +79,7 @@ static bool fill_mps_tensor_(Tensor& self, uint8_t value) {
   if (self.is_contiguous()) {
     MPSStream* stream = getCurrentMPSStream();
     auto storage_byte_offset = self.storage_offset() * self.itemsize();
-    stream->fill(mps::getMTLBufferStorage(self), 0, self.storage().nbytes(), storage_byte_offset);
+    stream->fill(mps::getMTLBufferStorage(self), value, self.nbytes(), storage_byte_offset);
     return true;
   }
   return false;

malfet added a commit that referenced this issue Nov 30, 2023
This fixes regression introduced by #81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes #114692
@Tripton
Copy link
Author

Tripton commented Nov 30, 2023

Thanks for the quick fix! It seems to resolve my problem and also the test case looks good. However, I noticed the removal of .storage(), previously added in another PR fixing another issue. I'm not really sure, why the .storage() was added in the first place, but maybe It's necessary to fix the other issue?

atalman pushed a commit to atalman/pytorch that referenced this issue Dec 1, 2023
This fixes regression introduced by pytorch#81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes pytorch#114692

Pull Request resolved: pytorch#114838
Approved by: https://github.com/atalman, https://github.com/kulinseth
malfet added a commit that referenced this issue Dec 1, 2023
This fixes regression introduced by #81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes #114692

Cherry pick of #114838 into release/2.1 branch

Co-authored-by: Nikita Shulga <nshulga@meta.com>
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this issue Dec 21, 2023
This fixes regression introduced by pytorch#81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes pytorch#114692

Pull Request resolved: pytorch#114838
Approved by: https://github.com/atalman, https://github.com/kulinseth
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this issue Dec 21, 2023
This fixes regression introduced by pytorch#81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes pytorch#114692

Pull Request resolved: pytorch#114838
Approved by: https://github.com/atalman, https://github.com/kulinseth
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: mps Related to Apple Metal Performance Shaders framework triage review
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants