torch.utils.data.random_split crashes without an error message with non CPU Generator object #44714

ProGamerGov · 2020-09-15T15:27:17Z

🐛 Bug

Non CPU generator objects cause torch.utils.data.random_split to fail without any error message

To Reproduce

Steps to reproduce the behavior:

Create a Generator object with a device type CUDA.
Add that CUDA Generator to torch.utils.data.random_split function.
Run code, and watch how it fails without any error message.

import torch                                                                                                                                                                              
                                                                                                                                                
rnd_generator = torch.Generator(device='cuda:0')

print(sorted(torch.utils.data.random_split([1,2,3,4,5,6,7,8,9,0], [8,2], generator=rnd_generator)[0]))

Expected behavior

The device type of the Generator object either shouldn't affect torch.utils.data.random_split or an error message should be thrown.

Environment

PyTorch version: 1.6.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 418.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:

[pip3] numpy==1.18.5
[pip3] torch==1.6.0+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.7.0+cu101
[conda] Could not collect

Additional context

The above is from Google Colab (the instance crashed when I ran the test code), and I can also confirm the issue is present on Windows as well.

cc @ezyang @gchanan @zou3519 @pbelevich

The text was updated successfully, but these errors were encountered:

ssnl · 2020-09-15T17:59:51Z

Root cause is randperm

In [5]: torch.randperm(3, generator=torch.Generator('cuda'))
[1]    36 segmentation fault  ipython

ssnl · 2020-09-15T18:03:11Z

I suppose

pytorch/aten/src/ATen/native/TensorFactories.cpp

Lines 717 to 720 in 7e91728

    
           Tensor randperm(int64_t n, c10::optional<Generator> generator, const TensorOptions& options) { 
        
             auto tensor = at::empty(n, options); 
        
             return at::randperm_out(tensor, n, generator); 
        
           }

should be modified to check (and allow?) cuda generator before creating the tensor.

ezyang · 2020-09-21T17:02:48Z

@ssnl It would be better to do the check inside randperm_out, no?

ssnl · 2020-09-21T18:16:45Z

@ezyang It depends on whether we want to allow torch.randperm(..., generator=a_cuda_gen) (specifying no device, but just the generator).

ezyang · 2020-10-27T20:19:31Z

@ssnl I just understood what your comment here meant. Let me try to elaborate it for the benefit of @janeyx99 .

The most basic version of this bug that needs to be fixed is that we allow you to do this: torch.randperm(3, device='cpu', generator=torch.Generator(device='cuda')). This is obviously wrong and should raise an error (as should all other variants of it.

However, there is a more subtle design consideration: torch.randperm(3, generator=torch.Generator(device='cuda')). Here, the code is not obviously wrong: we did not explicitly ask for a CPU device, and one might imagine that torch.randperm could just implicitly determine the correct device to allocate the random tensor on based on the generator. Generators are per device so this would fully specify the device. I believe this is what @ssnl is suggesting we support.

I think that it is reasonable to support this, but I think it would be more complicated to do correctly, and we should just fix the first bug first.

janeyx99 · 2020-10-27T21:01:16Z

@ezyang Thank you for the clarification. When you say all other variants of the first case, is it when the generator is of a different device than the one specified or is it for any cuda generator? I thought it should be the first but the outputs of the code below don't seem to fit that:

torch.randperm(3, device='cuda', generator=torch.Generator(device='cuda'))
Segmentation fault (core dumped)

torch.randperm(3, device='cuda', generator=torch.Generator(device='cpu'))
tensor([0, 1, 2], device='cuda:0') --> it returns this the 4 times I ran it though, which doesn't seem random.

ezyang · 2020-10-27T21:06:37Z

Oh ok, maybe the problem runs more deep than I thought. Some investigation sounds necessary :)

janeyx99 · 2020-10-28T16:45:41Z

Update: After some investigation, I believe this is the reason the above happens.

pytorch/aten/src/ATen/native/cuda/TensorFactories.cu

Lines 91 to 95 in 7e91728

    
           if (n < 30000) {  // For small inputs, we offload it to CPU instead. 
        
             auto result_cpu = at::empty({n}, result.options().device(kCPU)); 
        
             randperm_out(result_cpu, n, generator); 
        
             return result.copy_(result_cpu); 
        
           }

In the cuda implementation of randperm_out, we offload computation to the CPU when n is smaller than 30000 without giving any consideration of the generator device.

To confirm this, the following code works as expected:

torch.randperm(30000, device='cuda', generator=torch.Generator(device='cuda')) # should work and does
> tensor([19784, 26331, 27863,  ..., 12151, 14326,  6622], device='cuda:0')

torch.randperm(30000, device='cuda', generator=torch.Generator(device='cpu')) # should not work and doesn't
> RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

So now the question is: do we still want to offload to the CPU for small inputs when there is a generator is defined?

ssnl · 2020-10-28T21:26:28Z

Oh no... even if we do not offload to CPU when generator is defined, this means nondeterministism when CUDA seed is set...

ssnl · 2020-10-28T21:30:37Z

>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(3, device='cuda')
>>> x
tensor([2, 0, 1], device='cuda:0')
>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(3, device='cuda')
>>> x
tensor([0, 1, 2], device='cuda:0')
>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(30003, device='cuda')
>>> x[:5]
tensor([23025, 28065, 12737,  1352,  2876], device='cuda:0')
>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(30003, device='cuda')
>>> x[:5]
tensor([23025, 28065, 12737,  1352,  2876], device='cuda:0')

janeyx99 · 2020-10-28T21:41:39Z

Hm so what would be the right thing to do here? I'll submit a PR for a quick fix to the first issue, but I'm not quite sure how to handle the small tensor with cuda generator case.

ezyang · 2020-10-29T18:06:50Z

This seems tricky, and related to #46148 . cc @mcarilli

I guess, hypothetically, because we currently store CUDA rng state on cpu, we could have some way of using the CUDA state to feed the CPU generator in this case. This would be hard to do if CUDA rng state lived entirely on GPU, which seems like a better end term state. But then I think we're just out of luck, without doing a sync to get the rng state to CPU. But maybe this is fine.

mcarilli · 2020-10-30T16:48:06Z

Silent offload to CPU for N<30000 is also annoying for me right this second because CPU work won't be captured by cuda graphs...

we could have some way of using the CUDA state to feed the CPU generator in this case.

There are calls to retrieve relevant Philox state values from CUDA generators which could seed a CPU generator, but only if the CPU generator also uses philox.

The GPU-state maintenance added by my PR provides the same interface to retrieve Philox state values. It syncs as needed under the hood so usage doesn't need to change at all in the caller, but the fact that it needs to sync is annoying. It could simultaneously maintain dummy values on the CPU as well, and update them alongside GPU-side state tensors, for syncfree retrieval of the state values...but that would break under cuda graphs, because they'd elide cpu maintenance of dummy values.

What's the performance delta between CPU and GPU randperm for, say, 10000 elements? If the delta is negligible, just run on GPU all the time. Even if the delta is significant, if the runtime is negligible to begin with, it might be worth running on GPU all the time for simplicity.

demmerichs · 2021-04-19T12:41:03Z

Why did this issue got closed? I also ran into this issue today, specifying for torch.random the device cuda and a generator placed on cuda as well, but it throws a segmentation fault. This seems like a bad bug and should have priority?

Are there any known work-arounds for users at least? I could not find any in this issue?

ezyang · 2021-04-19T14:45:12Z

@denmerichs What version of PyTorch are you running? If you don't have a mismatch between the generator device and the requested device, this is likely a different bug, please file a new issue for this.

Kae1101 · 2021-06-19T12:43:19Z

Hi, @ezyang
today when I run my code (which was run successfully before) in CoLab using any functions about data iteration (like iter(dataloader), dataloader.next(), for idx, data in dataloader and so on), I have the 【 RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'】. And this Error is related to the torch.randperm. More details is shown below:

/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py in iter(self)
122 yield from torch.randint(high=n, size=(self.num_samples % 32,), dtype=torch.int64, generator=generator).tolist()
123 else:
--> 124 yield from torch.randperm(n, generator=generator).tolist()
125
126 def len(self) -> int:

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

It is worth noting that I never met this Error and run the same code successfully before tonight... ...
I am so confused now.... ....

NLQVan · 2021-06-21T16:51:55Z

@Kae1101
I got the same problem 2 day ago in colab, my code also ran find before. I think the problem happen when your DataLoader for training set attribute shuffle=True, you can try with your test DataLoader, which shuffle attribute set to False, the problem won't happen.
I find the way to make my code run again, hope it will help you
add attribute generator to your DataLoader and set it like this:

train_loader = DataLoader(trainset,batch_size=batch_size,shuffle=True,num_workers=0,pin_memory=False, generator=torch.Generator(device='cuda'))
Hope it gonna help :)))

Kae1101 · 2021-06-21T17:04:15Z

@Kae1101
I got the same problem 2 day ago in colab, my code also ran find before. I think the problem happen when your DataLoader for training set attribute shuffle=True, you can try with your test DataLoader, which shuffle attribute set to False, the problem won't happen.
I find the way to make my code run again, hope it will help you
add attribute generator to your DataLoader and set it like this:

train_loader = DataLoader(trainset,batch_size=batch_size,shuffle=True,num_workers=0,pin_memory=False, generator=torch.Generator(device='cuda'))
Hope it gonna help :)))

@NLQVan
Thanks for your solution~But I already fixed the problem by commenting out the following command:

torch.set_default_tensor_type(torch.cuda.FloatTensor)

does dataiter.next() return cpu.floatTensor by default? If it does, I think that is why the error was reported... ...

But I still confused because I ran the same code may be more than 50 times a week before 2021/06/19 and didn't get any errors like this.

NLQVan · 2021-06-21T17:48:27Z

My code also ran fine before 19/06/2021, maybe the library of torch was changed something and we didn't know. I also try to fix my code by comment the command "torch.set_default_tensor_type(torch.cuda.FloatTensor)", but my model get another error, it like: "found 2 type of device: cuda:0 and !cpu", my solutions above fixed this error in my code.

ezyang · 2021-06-21T20:37:26Z

@NLQVan @Kae1101 I'm having trouble following this conversation; could you open a new bug report for your issues? Thanks!

R-N · 2021-07-31T16:48:14Z

@Kae1101
I got the same problem 2 day ago in colab, my code also ran find before. I think the problem happen when your DataLoader for training set attribute shuffle=True, you can try with your test DataLoader, which shuffle attribute set to False, the problem won't happen.
I find the way to make my code run again, hope it will help you
add attribute generator to your DataLoader and set it like this:
train_loader = DataLoader(trainset,batch_size=batch_size,shuffle=True,num_workers=0,pin_memory=False, generator=torch.Generator(device='cuda'))
Hope it gonna help :)))

@NLQVan
Thanks for your solution~But I already fixed the problem by commenting out the following command:

torch.set_default_tensor_type(torch.cuda.FloatTensor)

does dataiter.next() return cpu.floatTensor by default? If it does, I think that is why the error was reported... ...

But I still confused because I ran the same code may be more than 50 times a week before 2021/06/19 and didn't get any errors like this.

Thanks. I was having the same issue and downgrading to 1.8.1 (March release) from 1.9.0 (June release, current latest) fixes this.

ngimel added the high priority label Sep 15, 2020

pytorch-probot bot added the triage review label Sep 15, 2020

ezyang added small We think this is a small issue to fix. Consider knocking off high priority small issues module: bootcamp We plan to do a full writeup on the issue, and then get someone to do it for onboarding labels Sep 21, 2020

agolynski removed the triage review label Sep 21, 2020

malfet assigned samestep Sep 21, 2020

malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 22, 2020

samestep assigned malfet and unassigned samestep Sep 22, 2020

malfet mentioned this issue Oct 9, 2020

torch.pow doesn't check for tensors being on different devices #46037

Closed

janeyx99 mentioned this issue Oct 28, 2020

randperm: add torch check to ensure generator device = tensor device #47022

Closed

facebook-github-bot closed this as completed in e4bc785 Nov 4, 2020

ngimel mentioned this issue Jan 7, 2021

[doc] Broken formatting for the function signature of torch.randperm #50207

Closed

dunbar12138 mentioned this issue Aug 12, 2021

CUDA / CPU error~! dunbar12138/DSNeRF#3

Open

chschroeder mentioned this issue Oct 12, 2021

torch.randperm: RuntimeError: Expected a 'cuda' device type for generator but found 'cpu' webis-de/small-text#2

Closed

aldakata mentioned this issue Mar 17, 2022

Requirements bupt-ai-cz/PGDF#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.utils.data.random_split crashes without an error message with non CPU Generator object #44714

torch.utils.data.random_split crashes without an error message with non CPU Generator object #44714

ProGamerGov commented Sep 15, 2020 •

edited by pytorch-probot bot

Loading

ssnl commented Sep 15, 2020

ssnl commented Sep 15, 2020

ezyang commented Sep 21, 2020

ssnl commented Sep 21, 2020 •

edited

Loading

ezyang commented Oct 27, 2020

janeyx99 commented Oct 27, 2020

ezyang commented Oct 27, 2020

janeyx99 commented Oct 28, 2020

ssnl commented Oct 28, 2020

ssnl commented Oct 28, 2020

janeyx99 commented Oct 28, 2020

ezyang commented Oct 29, 2020

mcarilli commented Oct 30, 2020 •

edited

Loading

demmerichs commented Apr 19, 2021 •

edited

Loading

ezyang commented Apr 19, 2021

Kae1101 commented Jun 19, 2021

NLQVan commented Jun 21, 2021

Kae1101 commented Jun 21, 2021

NLQVan commented Jun 21, 2021

ezyang commented Jun 21, 2021

R-N commented Jul 31, 2021 •

edited

Loading

torch.set_default_tensor_type(torch.cuda.FloatTensor)

torch.utils.data.random_split crashes without an error message with non CPU Generator object #44714

torch.utils.data.random_split crashes without an error message with non CPU Generator object #44714

Comments

ProGamerGov commented Sep 15, 2020 • edited by pytorch-probot bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

ssnl commented Sep 15, 2020

ssnl commented Sep 15, 2020

ezyang commented Sep 21, 2020

ssnl commented Sep 21, 2020 • edited Loading

ezyang commented Oct 27, 2020

janeyx99 commented Oct 27, 2020

ezyang commented Oct 27, 2020

janeyx99 commented Oct 28, 2020

ssnl commented Oct 28, 2020

ssnl commented Oct 28, 2020

janeyx99 commented Oct 28, 2020

ezyang commented Oct 29, 2020

mcarilli commented Oct 30, 2020 • edited Loading

demmerichs commented Apr 19, 2021 • edited Loading

ezyang commented Apr 19, 2021

Kae1101 commented Jun 19, 2021

NLQVan commented Jun 21, 2021

Kae1101 commented Jun 21, 2021

torch.set_default_tensor_type(torch.cuda.FloatTensor)

NLQVan commented Jun 21, 2021

ezyang commented Jun 21, 2021

R-N commented Jul 31, 2021 • edited Loading

torch.set_default_tensor_type(torch.cuda.FloatTensor)

ProGamerGov commented Sep 15, 2020 •

edited by pytorch-probot bot

Loading

ssnl commented Sep 21, 2020 •

edited

Loading

mcarilli commented Oct 30, 2020 •

edited

Loading

demmerichs commented Apr 19, 2021 •

edited

Loading

R-N commented Jul 31, 2021 •

edited

Loading