MPS torch.where() is giving objectively incorrect results, leading to critical calculation errors #122916

aradley · 2024-03-28T19:56:17Z

🐛 Describe the bug

I think I have an example of how MPS can get completely different results from CPU. Hopefully the simplicity of this example will be clear and helpful. This may be related to a previous issue noted on this forum (#84936).

import numpy as np
import torch
mps_device = torch.device("mps")

## Create a numpy matrix with many zeros
np.random.seed(0)
Numpy_Test = np.random.random(200000000)
indices = np.random.choice(np.arange(Numpy_Test.size), replace=False,size=int(Numpy_Test.size * 0.6))
Numpy_Test[indices] = 0
Numpy_Matrix = Numpy_Test.reshape((20000,10000))

## Get the indices of non-zero values in the matrix, and convert these indices into a numpy array
indices = np.where(Numpy_Matrix != 0)
indices = np.asarray(indices)

## Use numpy, torch, or a torch.mps object to find where indices[1] == 8000
# Using np.where
np.where(indices[1] == 8000)[0]
array([   19165,    27061,    39165, ..., 79979029, 79987021, 79995171])

# Using torch.where
torch.where(torch.from_numpy(indices)[1] == 8000)[0]
tensor([   19165,    27061,    39165,  ..., 79979029, 79987021, 79995171])

# Using torch.where with an NPS object
torch.where(torch.from_numpy(indices)[1].to(mps_device) == 8000)[0]
tensor([   19165,    27061,    39165,  ..., 79979032, 79987024, 79995168], device='mps:0')

Notice how the first two np.where and torch.where examples give them same results, but when using the tensor converted to MPS we get different results?

If I've not made an obvious mistake, this is a clear example of how MPS completely ruins calculations, because in this case, the indexes change, and all downstream calculations become meaningless.

Versions

torch version v0.2.1 and v0.2.0

cc @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr

alpoge · 2024-03-30T01:10:10Z

hey solved this :D! i'll put in a pull request w the fix v soon (sry im a bit slow, am a pure mathematician and thus never actually pr'ed before...)

basically the issue is that in the mpsGraph scatter operation the tensor that is written from, which in the case of this torch.where call (which turns out to be a torch.nonzero call) is a list of coordinates which are int32, is secretly being cast to a float32 behind the scenes. you'll notice that the outputs always have the top 24 bits correct! (and indeed casting an int to a float starts rounding it past 2^(24)!)

so all that is required is to split the coordinates tensor into two in the mpsGraph calls --- one modulo 2^(23), say, and one (integer-)divided by 2^(23), scatter those, and then add them back up

i should have the fix for this requested very soon!!! all credit to @Jckwind for spreading the word about this (and for getting me up to speed)!

hopefully a number of these other MPS arithmetic issues are related, we shall see...

kulinseth · 2024-04-10T23:34:06Z

Thanks @alpoge for the fix. We are looking into if there is a more efficient way to do where we can use all the int32 index range values.

alpoge mentioned this issue Mar 30, 2024

fixed torch.where and torch.nonzero issue #123024

Open

theo-costain-arondite mentioned this issue Apr 24, 2024

aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

Open

hvaara mentioned this issue May 6, 2024

CPU and MPS indexing give different results such that MPS ruins downstream calculations #122233

Closed

skotapati mentioned this issue May 14, 2024

[MPS] Add workaround for nonzero with large/complex inputs #126188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPS torch.where() is giving objectively incorrect results, leading to critical calculation errors #122916

MPS torch.where() is giving objectively incorrect results, leading to critical calculation errors #122916

aradley commented Mar 28, 2024 •

edited by malfet

Loading

alpoge commented Mar 30, 2024 •

edited

Loading

kulinseth commented Apr 10, 2024

MPS torch.where() is giving objectively incorrect results, leading to critical calculation errors #122916

MPS torch.where() is giving objectively incorrect results, leading to critical calculation errors #122916

Comments

aradley commented Mar 28, 2024 • edited by malfet Loading

🐛 Describe the bug

Versions

alpoge commented Mar 30, 2024 • edited Loading

kulinseth commented Apr 10, 2024

aradley commented Mar 28, 2024 •

edited by malfet

Loading

alpoge commented Mar 30, 2024 •

edited

Loading