Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.utils.data.random_split crashes without an error message with non CPU Generator object #44714

Closed
ProGamerGov opened this issue Sep 15, 2020 · 21 comments
Assignees
Labels
high priority module: bootcamp We plan to do a full writeup on the issue, and then get someone to do it for onboarding module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: random Related to random number generation in PyTorch (rng generator) small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ProGamerGov
Copy link
Contributor

ProGamerGov commented Sep 15, 2020

🐛 Bug

Non CPU generator objects cause torch.utils.data.random_split to fail without any error message

To Reproduce

Steps to reproduce the behavior:

  1. Create a Generator object with a device type CUDA.
  2. Add that CUDA Generator to torch.utils.data.random_split function.
  3. Run code, and watch how it fails without any error message.
import torch                                                                                                                                                                              
                                                                                                                                                
rnd_generator = torch.Generator(device='cuda:0')

print(sorted(torch.utils.data.random_split([1,2,3,4,5,6,7,8,9,0], [8,2], generator=rnd_generator)[0]))                                                                                                                   

Expected behavior

The device type of the Generator object either shouldn't affect torch.utils.data.random_split or an error message should be thrown.

Environment

  • PyTorch version: 1.6.0+cu101

  • Is debug build: False

  • CUDA used to build PyTorch: 10.1

  • ROCM used to build PyTorch: N/A

  • OS: Ubuntu 18.04.5 LTS (x86_64)

  • GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

  • Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)

  • CMake version: version 3.12.0

  • Python version: 3.6 (64-bit runtime)

  • Is CUDA available: True

  • CUDA runtime version: 10.1.243

  • GPU models and configuration: GPU 0: Tesla K80

  • Nvidia driver version: 418.67

  • cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

  • HIP runtime version: N/A

  • MIOpen runtime version: N/A

Versions of relevant libraries:

  • [pip3] numpy==1.18.5
  • [pip3] torch==1.6.0+cu101
  • [pip3] torchsummary==1.5.1
  • [pip3] torchtext==0.3.1
  • [pip3] torchvision==0.7.0+cu101
  • [conda] Could not collect

Additional context

The above is from Google Colab (the instance crashed when I ran the test code), and I can also confirm the issue is present on Windows as well.

cc @ezyang @gchanan @zou3519 @pbelevich

@ngimel ngimel added module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: random Related to random number generation in PyTorch (rng generator) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module triage review and removed triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 15, 2020
@ssnl
Copy link
Collaborator

ssnl commented Sep 15, 2020

Root cause is randperm

In [5]: torch.randperm(3, generator=torch.Generator('cuda'))
[1]    36 segmentation fault  ipython

@ssnl
Copy link
Collaborator

ssnl commented Sep 15, 2020

I suppose

Tensor randperm(int64_t n, c10::optional<Generator> generator, const TensorOptions& options) {
auto tensor = at::empty(n, options);
return at::randperm_out(tensor, n, generator);
}

should be modified to check (and allow?) cuda generator before creating the tensor.

@ezyang ezyang added small We think this is a small issue to fix. Consider knocking off high priority small issues module: bootcamp We plan to do a full writeup on the issue, and then get someone to do it for onboarding labels Sep 21, 2020
@ezyang
Copy link
Contributor

ezyang commented Sep 21, 2020

@ssnl It would be better to do the check inside randperm_out, no?

@ssnl
Copy link
Collaborator

ssnl commented Sep 21, 2020

@ezyang It depends on whether we want to allow torch.randperm(..., generator=a_cuda_gen) (specifying no device, but just the generator).

@malfet malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 22, 2020
@samestep samestep assigned malfet and unassigned samestep Sep 22, 2020
@ezyang
Copy link
Contributor

ezyang commented Oct 27, 2020

@ssnl I just understood what your comment here meant. Let me try to elaborate it for the benefit of @janeyx99 .

The most basic version of this bug that needs to be fixed is that we allow you to do this: torch.randperm(3, device='cpu', generator=torch.Generator(device='cuda')). This is obviously wrong and should raise an error (as should all other variants of it.

However, there is a more subtle design consideration: torch.randperm(3, generator=torch.Generator(device='cuda')). Here, the code is not obviously wrong: we did not explicitly ask for a CPU device, and one might imagine that torch.randperm could just implicitly determine the correct device to allocate the random tensor on based on the generator. Generators are per device so this would fully specify the device. I believe this is what @ssnl is suggesting we support.

I think that it is reasonable to support this, but I think it would be more complicated to do correctly, and we should just fix the first bug first.

@janeyx99
Copy link
Contributor

@ezyang Thank you for the clarification. When you say all other variants of the first case, is it when the generator is of a different device than the one specified or is it for any cuda generator? I thought it should be the first but the outputs of the code below don't seem to fit that:

torch.randperm(3, device='cuda', generator=torch.Generator(device='cuda'))
Segmentation fault (core dumped)

torch.randperm(3, device='cuda', generator=torch.Generator(device='cpu'))
tensor([0, 1, 2], device='cuda:0') --> it returns this the 4 times I ran it though, which doesn't seem random.

@ezyang
Copy link
Contributor

ezyang commented Oct 27, 2020

Oh ok, maybe the problem runs more deep than I thought. Some investigation sounds necessary :)

@janeyx99
Copy link
Contributor

Update: After some investigation, I believe this is the reason the above happens.

if (n < 30000) { // For small inputs, we offload it to CPU instead.
auto result_cpu = at::empty({n}, result.options().device(kCPU));
randperm_out(result_cpu, n, generator);
return result.copy_(result_cpu);
}

In the cuda implementation of randperm_out, we offload computation to the CPU when n is smaller than 30000 without giving any consideration of the generator device.

To confirm this, the following code works as expected:

torch.randperm(30000, device='cuda', generator=torch.Generator(device='cuda')) # should work and does
> tensor([19784, 26331, 27863,  ..., 12151, 14326,  6622], device='cuda:0')

torch.randperm(30000, device='cuda', generator=torch.Generator(device='cpu')) # should not work and doesn't
> RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

So now the question is: do we still want to offload to the CPU for small inputs when there is a generator is defined?

@ssnl
Copy link
Collaborator

ssnl commented Oct 28, 2020

Oh no... even if we do not offload to CPU when generator is defined, this means nondeterministism when CUDA seed is set...

@ssnl
Copy link
Collaborator

ssnl commented Oct 28, 2020

>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(3, device='cuda')
>>> x
tensor([2, 0, 1], device='cuda:0')
>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(3, device='cuda')
>>> x
tensor([0, 1, 2], device='cuda:0')
>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(30003, device='cuda')
>>> x[:5]
tensor([23025, 28065, 12737,  1352,  2876], device='cuda:0')
>>> torch.cuda.manual_seed_all(12)
>>> x = torch.randperm(30003, device='cuda')
>>> x[:5]
tensor([23025, 28065, 12737,  1352,  2876], device='cuda:0')

@janeyx99
Copy link
Contributor

Hm so what would be the right thing to do here? I'll submit a PR for a quick fix to the first issue, but I'm not quite sure how to handle the small tensor with cuda generator case.

@ezyang
Copy link
Contributor

ezyang commented Oct 29, 2020

This seems tricky, and related to #46148 . cc @mcarilli

I guess, hypothetically, because we currently store CUDA rng state on cpu, we could have some way of using the CUDA state to feed the CPU generator in this case. This would be hard to do if CUDA rng state lived entirely on GPU, which seems like a better end term state. But then I think we're just out of luck, without doing a sync to get the rng state to CPU. But maybe this is fine.

@mcarilli
Copy link
Collaborator

mcarilli commented Oct 30, 2020

Silent offload to CPU for N<30000 is also annoying for me right this second because CPU work won't be captured by cuda graphs...

we could have some way of using the CUDA state to feed the CPU generator in this case.

There are calls to retrieve relevant Philox state values from CUDA generators which could seed a CPU generator, but only if the CPU generator also uses philox.

The GPU-state maintenance added by my PR provides the same interface to retrieve Philox state values. It syncs as needed under the hood so usage doesn't need to change at all in the caller, but the fact that it needs to sync is annoying. It could simultaneously maintain dummy values on the CPU as well, and update them alongside GPU-side state tensors, for syncfree retrieval of the state values...but that would break under cuda graphs, because they'd elide cpu maintenance of dummy values.

What's the performance delta between CPU and GPU randperm for, say, 10000 elements? If the delta is negligible, just run on GPU all the time. Even if the delta is significant, if the runtime is negligible to begin with, it might be worth running on GPU all the time for simplicity.

@demmerichs
Copy link

demmerichs commented Apr 19, 2021

Why did this issue got closed? I also ran into this issue today, specifying for torch.random the device cuda and a generator placed on cuda as well, but it throws a segmentation fault. This seems like a bad bug and should have priority?

Are there any known work-arounds for users at least? I could not find any in this issue?

@ezyang
Copy link
Contributor

ezyang commented Apr 19, 2021

@denmerichs What version of PyTorch are you running? If you don't have a mismatch between the generator device and the requested device, this is likely a different bug, please file a new issue for this.

@Kae1101
Copy link

Kae1101 commented Jun 19, 2021

Hi, @ezyang
today when I run my code (which was run successfully before) in CoLab using any functions about data iteration (like iter(dataloader), dataloader.next(), for idx, data in dataloader and so on), I have the 【 RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'】. And this Error is related to the torch.randperm. More details is shown below:

/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py in iter(self)
122 yield from torch.randint(high=n, size=(self.num_samples % 32,), dtype=torch.int64, generator=generator).tolist()
123 else:
--> 124 yield from torch.randperm(n, generator=generator).tolist()
125
126 def len(self) -> int:

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

It is worth noting that I never met this Error and run the same code successfully before tonight... ...
I am so confused now.... ....

@NLQVan
Copy link

NLQVan commented Jun 21, 2021

@Kae1101
I got the same problem 2 day ago in colab, my code also ran find before. I think the problem happen when your DataLoader for training set attribute shuffle=True, you can try with your test DataLoader, which shuffle attribute set to False, the problem won't happen.
I find the way to make my code run again, hope it will help you
add attribute generator to your DataLoader and set it like this:

train_loader = DataLoader(trainset,batch_size=batch_size,shuffle=True,num_workers=0,pin_memory=False, generator=torch.Generator(device='cuda'))
Hope it gonna help :)))

@Kae1101
Copy link

Kae1101 commented Jun 21, 2021

@Kae1101
I got the same problem 2 day ago in colab, my code also ran find before. I think the problem happen when your DataLoader for training set attribute shuffle=True, you can try with your test DataLoader, which shuffle attribute set to False, the problem won't happen.
I find the way to make my code run again, hope it will help you
add attribute generator to your DataLoader and set it like this:

train_loader = DataLoader(trainset,batch_size=batch_size,shuffle=True,num_workers=0,pin_memory=False, generator=torch.Generator(device='cuda'))
Hope it gonna help :)))

@NLQVan
Thanks for your solution~But I already fixed the problem by commenting out the following command:

torch.set_default_tensor_type(torch.cuda.FloatTensor)

does dataiter.next() return cpu.floatTensor by default? If it does, I think that is why the error was reported... ...

But I still confused because I ran the same code may be more than 50 times a week before 2021/06/19 and didn't get any errors like this.

@NLQVan
Copy link

NLQVan commented Jun 21, 2021

My code also ran fine before 19/06/2021, maybe the library of torch was changed something and we didn't know. I also try to fix my code by comment the command "torch.set_default_tensor_type(torch.cuda.FloatTensor)", but my model get another error, it like: "found 2 type of device: cuda:0 and !cpu", my solutions above fixed this error in my code.

@ezyang
Copy link
Contributor

ezyang commented Jun 21, 2021

@NLQVan @Kae1101 I'm having trouble following this conversation; could you open a new bug report for your issues? Thanks!

@R-N
Copy link

R-N commented Jul 31, 2021

@Kae1101
I got the same problem 2 day ago in colab, my code also ran find before. I think the problem happen when your DataLoader for training set attribute shuffle=True, you can try with your test DataLoader, which shuffle attribute set to False, the problem won't happen.
I find the way to make my code run again, hope it will help you
add attribute generator to your DataLoader and set it like this:
train_loader = DataLoader(trainset,batch_size=batch_size,shuffle=True,num_workers=0,pin_memory=False, generator=torch.Generator(device='cuda'))
Hope it gonna help :)))

@NLQVan
Thanks for your solution~But I already fixed the problem by commenting out the following command:

torch.set_default_tensor_type(torch.cuda.FloatTensor)

does dataiter.next() return cpu.floatTensor by default? If it does, I think that is why the error was reported... ...

But I still confused because I ran the same code may be more than 50 times a week before 2021/06/19 and didn't get any errors like this.

Thanks. I was having the same issue and downgrading to 1.8.1 (March release) from 1.9.0 (June release, current latest) fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: bootcamp We plan to do a full writeup on the issue, and then get someone to do it for onboarding module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: random Related to random number generation in PyTorch (rng generator) small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.