Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda issue #8

Closed
brad0taylor opened this issue Jan 19, 2017 · 7 comments
Closed

cuda issue #8

brad0taylor opened this issue Jan 19, 2017 · 7 comments

Comments

@brad0taylor
Copy link

When i run this code

a = torch.Tensor(5, 3) # construct a 5x3 matrix, uninitialized
b = torch.Tensor(5, 3) # construct a 5x3 matrix, uninitialized
if torch.cuda.is_available():
aa= a.cuda()
bb = b.cuda()
aa + bb

I get the following error message
RuntimeError: cuda runtime error (8) : invalid device function at /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.6_1484802121799/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:246

@soumith
Copy link
Member

soumith commented Jan 19, 2017

hey Brad. What GPU are you using?

@brad0taylor
Copy link
Author

brad0taylor commented Jan 19, 2017

Hi - I'm using a Geforce GTX950M and I'm using Cuda 8.0. The GPU install works ok with Tensorflow and Theano. Also I have Cudnn installed. I'm fairly new to this, so if you can point me to any diags, I'd be happy to run them. Also the errror message is from the last line of code.

@soumith
Copy link
Member

soumith commented Jan 19, 2017

@ngimel I compile the binaries with 5.2+PTX, in that case it should work on 5.0 via JITting right? or is PTX only generated forward (like 5.2+PTX only generates for 5.2+ archs)?

@ngimel
Copy link

ngimel commented Jan 19, 2017

Jitting works only forward, so probably you should add compilation for 5.0.

@soumith
Copy link
Member

soumith commented Jan 19, 2017

oh ok good to know. I'll add 5.0 to the list as well.

@brad0taylor it's a screw up on my end. I'll fix it with issuing new binaries by tomorrow so that you can reinstall and it'll work. If you want to try things right away, please install via source: https://github.com/pytorch/pytorch/blob/master/README.md#from-source

@brad0taylor
Copy link
Author

brad0taylor commented Jan 19, 2017

@soumith thanks very much. I'll try again tomorrow

@soumith
Copy link
Member

soumith commented Jan 23, 2017

has been fixed

@soumith soumith closed this as completed Jan 23, 2017
malfet added a commit that referenced this issue Jun 16, 2023
Should prevent crashes during NCCL initialization.

If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing `nn.DataParallel(model)`

For posterity:
```
% nvidia-smi 
Fri Jun 16 20:46:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    37W / 150W |    752MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   36C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

% NCCL_DEBUG=INFO python data_parallel_tutorial.py 
Let's use 4 GPUs!
c825878acf65:32373:32373 [0] NCCL INFO cudaDriverVersion 12010
c825878acf65:32373:32373 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
c825878acf65:32373:32373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.14.3+cuda11.7
c825878acf65:32373:32443 [0] NCCL INFO NET/IB : No device found.
c825878acf65:32373:32443 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
c825878acf65:32373:32443 [0] NCCL INFO Using network Socket
c825878acf65:32373:32445 [2] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Using network Socket
c825878acf65:32373:32444 [1] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
c825878acf65:32373:32445 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
c825878acf65:32373:32443 [0] NCCL INFO Channel 00/02 :    0   1   2   3
c825878acf65:32373:32444 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
c825878acf65:32373:32443 [0] NCCL INFO Channel 01/02 :    0   1   2   3
c825878acf65:32373:32443 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
Bus error (core dumped)

(lldb) bt
* thread #1, name = 'python', stop reason = signal SIGBUS
  * frame #0: 0x00007effcd6b0ded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145
    frame #1: 0x00007eff3985e425 libnccl.so.2`ncclShmOpen(char*, int, void**, void**, int) at shmutils.cc:52
    frame #2: 0x00007eff3985e377 libnccl.so.2`ncclShmOpen(shmPath="/dev/shm/nccl-7dX4mg", shmSize=9637888, shmPtr=0x00007efe4a59ac30, devShmPtr=0x00007efe4a59ac38, create=1) at shmutils.cc:61
    frame #3: 0x00007eff39863322 libnccl.so.2`::shmRecvSetup(comm=<unavailable>, graph=<unavailable>, myInfo=<unavailable>, peerInfo=<unavailable>, connectInfo=0x00007efe57fe3fe0, recv=0x00007efe4a05d2f0, channelId=0, connIndex=0) at shm.cc:110
    frame #4: 0x00007eff398446a4 libnccl.so.2`ncclTransportP2pSetup(ncclComm*, ncclTopoGraph*, int, int*) at transport.cc:33
    frame #5: 0x00007eff398445c0 libnccl.so.2`ncclTransportP2pSetup(comm=0x0000000062355ab0, graph=0x00007efe57fe6a40, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:89
    frame #6: 0x00007eff398367cd libnccl.so.2`::initTransportsRank(comm=0x0000000062355ab0, commId=<unavailable>) at init.cc:790
    frame #7: 0x00007eff398383fe libnccl.so.2`::ncclCommInitRankFunc(job_=<unavailable>) at init.cc:1089
    frame #8: 0x00007eff3984de07 libnccl.so.2`ncclAsyncJobMain(arg=0x000000006476e6d0) at group.cc:62
    frame #9: 0x00007effce0bf6db libpthread.so.0`start_thread + 219
    frame #10: 0x00007effcd64361f libc.so.6`__GI___clone at clone.S:95
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants