torch.topk has trouble with inf's and nan's #16762

lopezpaz · 2019-02-05T16:26:55Z

In master, the code:

import torch

x = torch.ones(10) / torch.zeros(10)
x[0] = torch.zeros(1) / torch.zeros(1)
x[3] = 3
print(x)
print(x.topk(1, largest=True))
print(x.topk(1, largest=False))

outputs:

tensor([nan, inf, inf, 3., inf, inf, inf, inf, inf, inf])
(tensor([inf]), tensor([9]))
(tensor([inf]), tensor([4]))

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @heitorschueroff

The text was updated successfully, but these errors were encountered:

fmassa · 2019-02-05T16:28:24Z

I think a fix similar to #15886 should be applied to topk (and maybe kthvalue as well).

fmassa · 2019-02-05T17:24:35Z

cc @umanwizard

umanwizard · 2019-02-05T21:56:15Z

It's not clear to me what the "correct" output is. Should NaN be regarded as smaller than any other element, larger than any other element, or does it depend on the direction? (For sort we consider NaNs to be larger than any element, but I think it would be a bit weird here if topk returned NaNs with priority over numbers...)

gchanan · 2019-02-05T23:32:47Z

make it an option maybe? It would also seem weird if the output of topk didn't match the option of sort.

t-vi · 2019-02-15T13:56:45Z

I might look into this when I think about moving topk to ATen.

ezyang · 2019-04-09T19:36:04Z

@umanwizard are you still planning to work on this?

I think it is more important for topk to be consistent with sort, than it is for it to return "non-NaN"s when there are NaNs in the input. NaN means something was done wrong.

umanwizard · 2019-04-09T19:41:31Z

I completely forgot about this. Thanks for the reminder. Yes I will fix it.

t-vi · 2019-04-09T19:51:35Z

I think the updated kthvalue has the changes and topk would get them automatically if we could move it to ATen, but it's blocked on ROCm choking on large files in #18350.

gchanan · 2019-08-22T20:33:44Z

@VitalyFedyunin to check if this is fixed.

daniyar-niantic · 2019-08-30T10:09:33Z

I'm also having issues with topk, even with the latest version available on conda. The differences in behaviour between inf and nan handling on cpu and gpu are quite disruptive:

import numpy as np
import torch

def test_nan_topk():
    print('torch version:', torch.__version__)
    print('cuda version:', torch.version.cuda)
    values = torch.randint(6*4, size=[6*4], dtype=torch.float32).reshape(6, 4)
    values.data[1, :] = np.NaN
    values.data[2, :] = np.inf
    values.data[3, :-1] = np.NaN
    values.data[4, :-1] = np.inf

    for values_on_device in [values.cpu(), values.cuda()]:
        k_values, k_idx = torch.topk(values_on_device, dim=1, largest=False, k=3)
        print()
        print('devices:\n', values_on_device.device, k_values.device, k_idx.device)
        print('values:\n', values_on_device)
        print('k smallest values:\n', k_values)
        print('k indices:\n', k_idx)

if __name__ == '__main__':
    test_nan_topk()

torch version: 1.2.0
cuda version: 10.0.130

devices:
 cpu cpu cpu
values:
 tensor([[18.,  2., 17.,  7.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 20.],
        [inf, inf, inf,  4.],
        [16., 15.,  4.,  5.]])
k smallest values:
 tensor([[ 2.,  7., 17.],
        [nan, nan, nan],
        [inf, inf, inf],
        [20., nan, nan],
        [ 4., inf, inf],
        [ 4.,  5., 15.]])
k indices:
 tensor([[1, 3, 2],
        [2, 3, 0],
        [2, 3, 0],
        [3, 2, 0],
        [3, 2, 0],
        [2, 3, 1]])

devices:
 cuda:0 cuda:0 cuda:0
values:
 tensor([[18.,  2., 17.,  7.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 20.],
        [inf, inf, inf,  4.],
        [16., 15.,  4.,  5.]], device='cuda:0')
k smallest values:
 tensor([[ 2.,  7., 17.],
        [ 0.,  0.,  0.],
        [inf, inf, inf],
        [ 0.,  0.,  0.],
        [ 4., inf, inf],
        [ 4.,  5., 15.]], device='cuda:0')
k indices:
 tensor([[1, 3, 2],
        [0, 0, 0],
        [2, 1, 0],
        [0, 0, 0],
        [3, 1, 0],
        [2, 3, 1]], device='cuda:0')

umanwizard · 2019-08-30T15:02:09Z

Sorry, I am not working on this anymore

ssnl · 2019-10-03T17:44:11Z

Just for tracking: this is still broken as of 1.3.0a0+5835460

Baranowski · 2020-02-21T09:03:53Z

The original issue is fixed on CPU as of 1.5.0a0+1a589f5

x = torch.ones(10) / torch.zeros(10)
x[0] = torch.zeros(1) / torch.zeros(1)
x[3] = 3
print(x)
print(x.topk(1, largest=True))
print(x.topk(1, largest=False))

outputs:

tensor([nan, inf, inf, 3., inf, inf, inf, inf, inf, inf])
torch.return_types.topk(
values=tensor([nan]),
indices=tensor([0]))
torch.return_types.topk(
values=tensor([3.]),
indices=tensor([3]))

This is consistent with sort() on both CPU and CUDA which consider inf and nan to be greater than numbers:

In [6]: a = torch.tensor([1,2,float('nan'),3,float('inf'),4])                                                                                                                                               

In [7]: a                                                                                                                                                                                                   
Out[7]: tensor([1., 2., nan, 3., inf, 4.])

In [8]: b = a.clone().cuda()                                                                                                                                                                                

In [9]: b                                                                                                                                                                                                   
Out[9]: tensor([1., 2., nan, 3., inf, 4.], device='cuda:0')

In [10]: a.sort()                                                                                                                                                                                           
Out[10]: 
torch.return_types.sort(
values=tensor([1., 2., 3., 4., inf, nan]),
indices=tensor([0, 1, 3, 5, 4, 2]))

In [11]: b.sort()                                                                                                                                                                                           
Out[11]: 
torch.return_types.sort(
values=tensor([1., 2., 3., 4., inf, nan], device='cuda:0'),
indices=tensor([0, 1, 3, 5, 4, 2], device='cuda:0'))

This also seems to be the desired default behaviour because it's consistent with numpy, as per #15886

The remaining two issues are:

that topk on CUDA is not consistent with CPU (or with sort() on any device). But CUDA's topk is still waiting to be migrated to ATen, as per Migrate topk from the TH to Aten (CUDA) #24648
that there is no way for users to force a different behaviour. @gchanan suggested adding an option. I think it's worth considering whether such option should be added only to topk, or also to other similar operators (e.g., sort)

peterbell10 · 2020-05-07T17:37:48Z

CUDA topk also handles nan and inf, as of #35253:

In [4]: a = torch.tensor([1,2,float('nan'),3,float('inf'),4])                   

In [5]: a.topk(4)                                                               
Out[5]: 
torch.return_types.topk(
values=tensor([nan, inf, 4., 3.]),
indices=tensor([2, 4, 5, 3]))

In [6]: a.cuda().topk(4)                                                        
Out[6]: 
torch.return_types.topk(
values=tensor([nan, inf, 4., 3.], device='cuda:0'),
indices=tensor([2, 4, 5, 3], device='cuda:0'))

cpuhrsch · 2021-08-30T20:45:15Z

Marking this as not triaged because it is high priority and nobody is assigned to work on it.

gchanan · 2021-09-02T15:05:52Z

@cpuhrsch triaged is for marking priority and modules, not assigning work.

cpuhrsch · 2021-09-02T16:23:31Z

To clarify, I thought this issue fell through the cracks, because some follow up work wasn't done and wanted to resurface it during our triage review. The last few comments seem to indicate it might be resolved, but it's also still marked high priority, so I thought it was worth bringing up again.

ngimel · 2021-09-13T17:14:09Z

Comments above say that the issue is fixed, leaving open for now to double check if there are tests.

ezyang · 2021-09-13T18:31:07Z

status of daniyar-niantic's script:

torch version: 1.10.0.dev20210913
cuda version: 11.1

devices:
 cpu cpu cpu
values:
 tensor([[18.,  9., 21., 21.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 16.],
        [inf, inf, inf,  7.],
        [ 6.,  1., 15., 15.]])
k smallest values:
 tensor([[ 9., 18., 21.],
        [nan, nan, nan],
        [inf, inf, inf],
        [16., nan, nan],
        [ 7., inf, inf],
        [ 1.,  6., 15.]])
k indices:
 tensor([[1, 0, 3],
        [2, 3, 0],
        [2, 3, 0],
        [3, 2, 0],
        [3, 2, 0],
        [1, 0, 3]])

devices:
 cuda:0 cuda:0 cuda:0
values:
 tensor([[18.,  9., 21., 21.],
        [nan, nan, nan, nan],
        [inf, inf, inf, inf],
        [nan, nan, nan, 16.],
        [inf, inf, inf,  7.],
        [ 6.,  1., 15., 15.]], device='cuda:0')
k smallest values:
 tensor([[ 9., 18., 21.],
        [nan, nan, nan],
        [inf, inf, inf],
        [16., nan, nan],
        [ 7., inf, inf],
        [ 1.,  6., 15.]], device='cuda:0')
k indices:
 tensor([[1, 0, 2],
        [2, 1, 0],
        [2, 1, 0],
        [3, 1, 0],
        [3, 1, 0],
        [1, 0, 2]], device='cuda:0')

ezyang · 2021-09-13T18:33:03Z

So hooray, CPU and CUDA are consistent now (as peterbell10 says), so we just need to make sure this is actually tested.

ezyang · 2021-09-13T18:33:25Z

which it is in peterbell10's script ,so I'm gonna close this issue

https://github.com/pytorch/pytorch/pull/35253/files#diff-8aa1a200ec63d23db422aa31b6dca1e6cb372887c43b064ef435210b1b0dec0aR12561-R12572

fmassa added the bug label Feb 5, 2019

umanwizard self-assigned this Feb 5, 2019

cpuhrsch removed the bug label Apr 9, 2019

ezyang added module: operators module: NaNs and Infs Problems related to NaN and Inf handling in floating point triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority labels Apr 9, 2019

gchanan assigned VitalyFedyunin Aug 22, 2019

ezyang unassigned umanwizard Aug 30, 2019

VitalyFedyunin removed their assignment May 5, 2020

nikitaved mentioned this issue May 11, 2020

migrate gather to ATen (CUDA) #37659

Closed

ngimel mentioned this issue May 11, 2020

median with float64, lot of nans, and dim=1 returns random results in CUDA #38018

Closed

nikitaved mentioned this issue May 14, 2020

Migrate topk from the TH to Aten (CUDA) #24648

Closed

vadimkantorov mentioned this issue May 17, 2020

NumPy-like Functionality Request Rollup #38349

Closed

41 tasks

QipengGuo mentioned this issue Jul 23, 2020

RuntimeError: Expected object of backend CPU but got backend CUDA dmlc/dgl#1805

Closed

heitorschueroff removed the module: operators (deprecated) label Oct 8, 2020

heitorschueroff added the module: sorting and selection label Oct 8, 2020

cpuhrsch removed the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 30, 2021

zou3519 added the triage review label Sep 1, 2021

gchanan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 2, 2021

ngimel removed high priority triage review labels Sep 13, 2021

ezyang closed this as completed Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.topk has trouble with inf's and nan's #16762

torch.topk has trouble with inf's and nan's #16762

lopezpaz commented Feb 5, 2019 •

edited by pytorch-probot bot

fmassa commented Feb 5, 2019

fmassa commented Feb 5, 2019

umanwizard commented Feb 5, 2019

gchanan commented Feb 5, 2019

t-vi commented Feb 15, 2019

ezyang commented Apr 9, 2019

umanwizard commented Apr 9, 2019

t-vi commented Apr 9, 2019 •

edited

gchanan commented Aug 22, 2019

daniyar-niantic commented Aug 30, 2019

umanwizard commented Aug 30, 2019

ssnl commented Oct 3, 2019

Baranowski commented Feb 21, 2020

peterbell10 commented May 7, 2020

cpuhrsch commented Aug 30, 2021

gchanan commented Sep 2, 2021

cpuhrsch commented Sep 2, 2021

ngimel commented Sep 13, 2021

ezyang commented Sep 13, 2021

ezyang commented Sep 13, 2021

ezyang commented Sep 13, 2021

torch.topk has trouble with inf's and nan's #16762

torch.topk has trouble with inf's and nan's #16762

Comments

lopezpaz commented Feb 5, 2019 • edited by pytorch-probot bot

fmassa commented Feb 5, 2019

fmassa commented Feb 5, 2019

umanwizard commented Feb 5, 2019

gchanan commented Feb 5, 2019

t-vi commented Feb 15, 2019

ezyang commented Apr 9, 2019

umanwizard commented Apr 9, 2019

t-vi commented Apr 9, 2019 • edited

gchanan commented Aug 22, 2019

daniyar-niantic commented Aug 30, 2019

umanwizard commented Aug 30, 2019

ssnl commented Oct 3, 2019

Baranowski commented Feb 21, 2020

peterbell10 commented May 7, 2020

cpuhrsch commented Aug 30, 2021

gchanan commented Sep 2, 2021

cpuhrsch commented Sep 2, 2021

ngimel commented Sep 13, 2021

ezyang commented Sep 13, 2021

ezyang commented Sep 13, 2021

ezyang commented Sep 13, 2021

lopezpaz commented Feb 5, 2019 •

edited by pytorch-probot bot

t-vi commented Apr 9, 2019 •

edited