Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.topk has trouble with inf's and nan's #16762

Closed
lopezpaz opened this issue Feb 5, 2019 · 21 comments
Closed

torch.topk has trouble with inf's and nan's #16762

lopezpaz opened this issue Feb 5, 2019 · 21 comments
Labels
module: NaNs and Infs Problems related to NaN and Inf handling in floating point module: sorting and selection triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@lopezpaz
Copy link

lopezpaz commented Feb 5, 2019

In master, the code:

import torch

x = torch.ones(10) / torch.zeros(10)
x[0] = torch.zeros(1) / torch.zeros(1)
x[3] = 3
print(x)
print(x.topk(1, largest=True))
print(x.topk(1, largest=False))

outputs:

tensor([nan, inf, inf, 3., inf, inf, inf, inf, inf, inf])
(tensor([inf]), tensor([9]))
(tensor([inf]), tensor([4]))

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @heitorschueroff

@fmassa
Copy link
Member

fmassa commented Feb 5, 2019

I think a fix similar to #15886 should be applied to topk (and maybe kthvalue as well).

@fmassa fmassa added the bug label Feb 5, 2019
@fmassa
Copy link
Member

fmassa commented Feb 5, 2019

cc @umanwizard

@umanwizard umanwizard self-assigned this Feb 5, 2019
@umanwizard
Copy link
Contributor

It's not clear to me what the "correct" output is. Should NaN be regarded as smaller than any other element, larger than any other element, or does it depend on the direction? (For sort we consider NaNs to be larger than any element, but I think it would be a bit weird here if topk returned NaNs with priority over numbers...)

@gchanan
Copy link
Contributor

gchanan commented Feb 5, 2019

make it an option maybe? It would also seem weird if the output of topk didn't match the option of sort.

@t-vi
Copy link
Collaborator

t-vi commented Feb 15, 2019

I might look into this when I think about moving topk to ATen.

@cpuhrsch cpuhrsch removed the bug label Apr 9, 2019
@ezyang ezyang added module: operators module: NaNs and Infs Problems related to NaN and Inf handling in floating point triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority labels Apr 9, 2019
@ezyang
Copy link
Contributor

ezyang commented Apr 9, 2019

@umanwizard are you still planning to work on this?

I think it is more important for topk to be consistent with sort, than it is for it to return "non-NaN"s when there are NaNs in the input. NaN means something was done wrong.

@umanwizard
Copy link
Contributor

I completely forgot about this. Thanks for the reminder. Yes I will fix it.

@t-vi
Copy link
Collaborator

t-vi commented Apr 9, 2019

I think the updated kthvalue has the changes and topk would get them automatically if we could move it to ATen, but it's blocked on ROCm choking on large files in #18350.

@gchanan
Copy link
Contributor

gchanan commented Aug 22, 2019

@VitalyFedyunin to check if this is fixed.

@daniyar-niantic
Copy link

I'm also having issues with topk, even with the latest version available on conda. The differences in behaviour between inf and nan handling on cpu and gpu are quite disruptive:

import numpy as np
import torch

def test_nan_topk():
    print('torch version:', torch.__version__)
    print('cuda version:', torch.version.cuda)
    values = torch.randint(6*4, size=[6*4], dtype=torch.float32).reshape(6, 4)
    values.data[1, :] = np.NaN
    values.data[2, :] = np.inf
    values.data[3, :-1] = np.NaN
    values.data[4, :-1] = np.inf

    for values_on_device in [values.cpu(), values.cuda()]:
        k_values, k_idx = torch.topk(values_on_device, dim=1, largest=False, k=3)
        print()
        print('devices:\n', values_on_device.device, k_values.device, k_idx.device)
        print('values:\n', values_on_device)
        print('k smallest values:\n', k_values)
        print('k indices:\n', k_idx)

if __name__ == '__main__':
    test_nan_topk()

torch version: 1.2.0
cuda version: 10.0.130

devices:
 cpu cpu cpu
values:
 tensor([[18.,  2., 17.,  7.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 20.],
        [inf, inf, inf,  4.],
        [16., 15.,  4.,  5.]])
k smallest values:
 tensor([[ 2.,  7., 17.],
        [nan, nan, nan],
        [inf, inf, inf],
        [20., nan, nan],
        [ 4., inf, inf],
        [ 4.,  5., 15.]])
k indices:
 tensor([[1, 3, 2],
        [2, 3, 0],
        [2, 3, 0],
        [3, 2, 0],
        [3, 2, 0],
        [2, 3, 1]])

devices:
 cuda:0 cuda:0 cuda:0
values:
 tensor([[18.,  2., 17.,  7.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 20.],
        [inf, inf, inf,  4.],
        [16., 15.,  4.,  5.]], device='cuda:0')
k smallest values:
 tensor([[ 2.,  7., 17.],
        [ 0.,  0.,  0.],
        [inf, inf, inf],
        [ 0.,  0.,  0.],
        [ 4., inf, inf],
        [ 4.,  5., 15.]], device='cuda:0')
k indices:
 tensor([[1, 3, 2],
        [0, 0, 0],
        [2, 1, 0],
        [0, 0, 0],
        [3, 1, 0],
        [2, 3, 1]], device='cuda:0')

@umanwizard
Copy link
Contributor

Sorry, I am not working on this anymore

@ssnl
Copy link
Collaborator

ssnl commented Oct 3, 2019

Just for tracking: this is still broken as of 1.3.0a0+5835460

@Baranowski
Copy link
Contributor

The original issue is fixed on CPU as of 1.5.0a0+1a589f5

x = torch.ones(10) / torch.zeros(10)
x[0] = torch.zeros(1) / torch.zeros(1)
x[3] = 3
print(x)
print(x.topk(1, largest=True))
print(x.topk(1, largest=False))

outputs:

tensor([nan, inf, inf, 3., inf, inf, inf, inf, inf, inf])
torch.return_types.topk(
values=tensor([nan]),
indices=tensor([0]))
torch.return_types.topk(
values=tensor([3.]),
indices=tensor([3]))

This is consistent with sort() on both CPU and CUDA which consider inf and nan to be greater than numbers:

In [6]: a = torch.tensor([1,2,float('nan'),3,float('inf'),4])                                                                                                                                               

In [7]: a                                                                                                                                                                                                   
Out[7]: tensor([1., 2., nan, 3., inf, 4.])

In [8]: b = a.clone().cuda()                                                                                                                                                                                

In [9]: b                                                                                                                                                                                                   
Out[9]: tensor([1., 2., nan, 3., inf, 4.], device='cuda:0')

In [10]: a.sort()                                                                                                                                                                                           
Out[10]: 
torch.return_types.sort(
values=tensor([1., 2., 3., 4., inf, nan]),
indices=tensor([0, 1, 3, 5, 4, 2]))

In [11]: b.sort()                                                                                                                                                                                           
Out[11]: 
torch.return_types.sort(
values=tensor([1., 2., 3., 4., inf, nan], device='cuda:0'),
indices=tensor([0, 1, 3, 5, 4, 2], device='cuda:0'))

This also seems to be the desired default behaviour because it's consistent with numpy, as per #15886

The remaining two issues are:

  • that topk on CUDA is not consistent with CPU (or with sort() on any device). But CUDA's topk is still waiting to be migrated to ATen, as per Migrate topk from the TH to Aten (CUDA) #24648
  • that there is no way for users to force a different behaviour. @gchanan suggested adding an option. I think it's worth considering whether such option should be added only to topk, or also to other similar operators (e.g., sort)

@VitalyFedyunin VitalyFedyunin removed their assignment May 5, 2020
@peterbell10
Copy link
Collaborator

CUDA topk also handles nan and inf, as of #35253:

In [4]: a = torch.tensor([1,2,float('nan'),3,float('inf'),4])                   

In [5]: a.topk(4)                                                               
Out[5]: 
torch.return_types.topk(
values=tensor([nan, inf, 4., 3.]),
indices=tensor([2, 4, 5, 3]))

In [6]: a.cuda().topk(4)                                                        
Out[6]: 
torch.return_types.topk(
values=tensor([nan, inf, 4., 3.], device='cuda:0'),
indices=tensor([2, 4, 5, 3], device='cuda:0'))

@cpuhrsch cpuhrsch removed the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 30, 2021
@cpuhrsch
Copy link
Contributor

Marking this as not triaged because it is high priority and nobody is assigned to work on it.

@gchanan gchanan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 2, 2021
@gchanan
Copy link
Contributor

gchanan commented Sep 2, 2021

@cpuhrsch triaged is for marking priority and modules, not assigning work.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Sep 2, 2021

To clarify, I thought this issue fell through the cracks, because some follow up work wasn't done and wanted to resurface it during our triage review. The last few comments seem to indicate it might be resolved, but it's also still marked high priority, so I thought it was worth bringing up again.

@ngimel
Copy link
Collaborator

ngimel commented Sep 13, 2021

Comments above say that the issue is fixed, leaving open for now to double check if there are tests.

@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2021

status of daniyar-niantic's script:

torch version: 1.10.0.dev20210913
cuda version: 11.1

devices:
 cpu cpu cpu
values:
 tensor([[18.,  9., 21., 21.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 16.],
        [inf, inf, inf,  7.],
        [ 6.,  1., 15., 15.]])
k smallest values:
 tensor([[ 9., 18., 21.],
        [nan, nan, nan],
        [inf, inf, inf],
        [16., nan, nan],
        [ 7., inf, inf],
        [ 1.,  6., 15.]])
k indices:
 tensor([[1, 0, 3],
        [2, 3, 0],
        [2, 3, 0],
        [3, 2, 0],
        [3, 2, 0],
        [1, 0, 3]])

devices:
 cuda:0 cuda:0 cuda:0
values:
 tensor([[18.,  9., 21., 21.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 16.],
        [inf, inf, inf,  7.],
        [ 6.,  1., 15., 15.]], device='cuda:0')
k smallest values:
 tensor([[ 9., 18., 21.],
        [nan, nan, nan],
        [inf, inf, inf],
        [16., nan, nan],
        [ 7., inf, inf],
        [ 1.,  6., 15.]], device='cuda:0')
k indices:
 tensor([[1, 0, 2],
        [2, 1, 0],
        [2, 1, 0],
        [3, 1, 0],
        [3, 1, 0],
        [1, 0, 2]], device='cuda:0')

@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2021

So hooray, CPU and CUDA are consistent now (as peterbell10 says), so we just need to make sure this is actually tested.

@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2021

@ezyang ezyang closed this as completed Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: NaNs and Infs Problems related to NaN and Inf handling in floating point module: sorting and selection triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests