-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPS convolution is sometimes returning NaNs for valid inputs. #84138
Comments
While I hesitate to ask, may I ask for a moment of your time for a simple sanity check please? When working on the CPU and MPS testcase for repeated torch:mm() in issue #81185, #81185 (comment), I noticed yesterday that a C++ printf() of a torch::tensor residing on the Intel Mac MPS GPU gave weird numbers like 1e-34,0.,0. instead of the "proper" data like 1,2,3,4,5. When all data copied back to the CPU then the tensor data was correct. That behavior puzzled me. Therefore, on a hunch, could someone humor me and try the Python code snippet below please on an Arm Mac with torch and MPS enabled? For reference, on a Linux machine with CUDA, creating the tensor on CPU, copy to GPU, and copy back to CPU, you see e.g.
and as you can see, the "on gpu 2" line has a pointer into the GPU memory and prints correct data. Yesterday, for a moment, I just wondered whether this behavior is a bit different with Mac+MPS? I am probably wrong but can someone confirm that we get something similar like the CPU+CUDA output above on an Arm Mac with MPS, and NOT a broken line like
Rational - If we DO see something like [1.e-34,0.111111, 0., 0., 0., 0.] then that might explain the issue #81185 as well as here why you see occasional NaNs. The checking code for NaNs would be looking at different memory than expected (as the CUDA vs MPS behavior would be different). But if this is the case a lot of other things would go wrong as well I think. If we see the correct output [1., 2., 3., 4., 5.] from MPS then that is a relieve and I keep checking why the C++ printf() via data_ptr() gave a different result for "on GPU" vs "on CPU". Sorry for the noise, thank you. import torch
cpu_device = torch.device("cpu")
#gpu_device = torch.device("cuda:0")
gpu_device = torch.device("mps")
def tensorPrint(prefix_, data_):
print (str(prefix_)+ ", ptr = "+ str(hex(data_.data_ptr())) + ", device= " + str(data_.device) + ", type= " + str(data_.type()) + ", data= " + str(data_))
if __name__ == '__main__':
print ("Torch version: ", torch.__version__)
x = torch.tensor([1.,2.,3.,4.,5.], dtype=torch.float32, device=cpu_device)
tensorPrint("on cpu 1", x)
x2 = x.to(gpu_device)
tensorPrint("on gpu 2", x2)
x3 = x2.to(cpu_device)
tensorPrint("on cpu 3", x3) |
|
Thank you! This tells me a few things
|
TL;DR On the same Intel iMac machine, the same MPS matrix multiple via Torch latest C++ does not generate NaN but the Python test case on same Torch DOES generate occasional NaNs. On on Intel iMac with Torch 1.13.0a0+gitf4f54c7 (git from yesterday, locally built) and macOS Monterey 12.5, I see that
When a NaN is found then I checked the first few elements of input and output matrices a,b,c and that looks normal to me. PS: on first invocation of "python3 bug.py" I see the warning
So where do we go from here? |
And the NaN is indeed also generated for the convolution case in the first post of this thread i.e. both torch.nn.functional.conv1d and torch.mm exhibit the NaN bevior on this setup and machine |
I start to wonder whether the NaNs are related to torch.mm or torch.conv at all. The code snippet below, adapted from earlier test case in Python, generates NaN as well. Which seems the reason for the NaNs in output matrix C after torch.mm. The NaN was in the input already ;) The code generates regular NaN on a MPS device (on an Intel iMac, 1.13.0a0+gitf4f54c7) but not on CPU. It seems like it is a randn() specific issue as torch.zeros(), ones(), full( (499, 526), 3.14, device='ops') work fine. Even better, a functional equivalent test case in C++ generating data on MPS via torch::rand() does not generate any NaN, but the Python snippet below does. Reminds me of another Torch MPS randn issue #84288 Am happy to followup further and put breakpoints and debug stuff in MPS version of randn() or Python rand() bindings but I could use a few pointers and filenames to get started please :)
|
Did you find the code. I found I haven't had time to track down Tensor::normal_ yet I also found a partial MPS Generator implementation at which it think is thin inheritance wrapper around the c10 generator (its been a couple of decades since I wrote any C++) https://github.com/pytorch/pytorch/blob/master/c10/core/GeneratorImpl.cpp which just calls /dev/urandom |
Sorry, no. I was hoping for some pointers into at::mps:: and randn like here for a similar issue PS: Thanks for your link to the MPS Generator. |
I tried to rewrite random_mps_impl of https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/mps/operations/Distributions.mm and I generated Nan with this simple swift code: import Foundation
import MetalPerformanceShadersGraph
var gDevice = MTLCreateSystemDefaultDevice()!
var gCommandQueue = gDevice.makeCommandQueue()!
let randomSize: Int = 5000000
let desc = MPSGraphRandomOpDescriptor(distribution: MPSGraphRandomDistribution.normal, dataType: MPSDataType.float32)
desc?.mean = 0.0
desc?.standardDeviation = 1.0
let graph = MPSGraph()
let randomTensor = graph.randomTensor(withShape: [randomSize as NSNumber], descriptor: desc!, seed: Int.random(in:0...256*256*256), name: nil)
let sum = graph.reductionSum(with: randomTensor, axes: [0], name: nil)
var feeds: [MPSGraphTensor:MPSGraphTensorData] = [:]
let fetch = graph.run(with: gCommandQueue,
feeds: feeds,
targetTensors: [sum],
targetOperations: [])
let output = fetch[sum]!
var sumResult: Float32 = 0.0
output.mpsndarray().readBytes(&sumResult, strideBytes: nil)
print(sumResult) The frequency of NaN seems to be the same as with the pytorch code torch.sum(torch.randn((5000000), device='mps')) This issue comes from mps. |
Splitting #81185 into two. This one focusses on the MPS side of the problem with the following repro:
cc @kulinseth
The text was updated successfully, but these errors were encountered: