Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV when passing numpy.int64 to torch functions (reshape, randn) #11012

Closed
nivha opened this issue Aug 29, 2018 · 3 comments
Closed

SIGSEGV when passing numpy.int64 to torch functions (reshape, randn) #11012

nivha opened this issue Aug 29, 2018 · 3 comments

Comments

@nivha
Copy link

nivha commented Aug 29, 2018

Issue description

While running on GPU (doesn't happen on CPU), and passing numpy.int64 to torch functions, such as:

# x is a torch.Tensor of size N,D (contiguous.. just to be sure)
N = x.shape[0]
D = numpy.prod(x.shape[1:])  # this outputs a numpy.int64 object
x.reshape(N, D)
# or 
torch.randn(N, D)

Which are inside the forward function of a class that inherits from torch.autograd.Function, I get Segmentation fault (core dumped).

(To make sure - I'm running with os.environ['CUDA_VISIBLE_DEVICES'] = '3' so it only runs on 1 GPU out of 4 on the server)

Unfortunately I couldn't create a minimal example - this happened to me on various occasions, after running each for several iterations (and it also depends on the initialization, e.g. random seed).
Nevertheless, when I converted D to int it solved the problem.
I did however manage to catch that inside gdb, here's the trace:

Stack trace

code->197% gdb --args /usr/username/python3/bin/python script.py
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/username/python3.5b/bin/python3.5...done.
(gdb) run
Starting program: /usr/username/python3/bin/python script.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Missing separate debuginfo for /usr/username/python3/lib/python3.5/site-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0
[New Thread 0x7fffecdf7700 (LWP 27992)]
[New Thread 0x7fffec5f6700 (LWP 27993)]
[New Thread 0x7fffe9df5700 (LWP 27994)]
[New Thread 0x7fffe75f4700 (LWP 27995)]
[New Thread 0x7fffe4df3700 (LWP 27996)]
[New Thread 0x7fffe25f2700 (LWP 27997)]
[New Thread 0x7fffdfdf1700 (LWP 27998)]
[New Thread 0x7fffdd5f0700 (LWP 27999)]
[New Thread 0x7fffdadef700 (LWP 28000)]
[New Thread 0x7fffd85ee700 (LWP 28001)]
[New Thread 0x7fffd5ded700 (LWP 28002)]
[New Thread 0x7fffd35ec700 (LWP 28003)]
[New Thread 0x7fffd0deb700 (LWP 28004)]
[New Thread 0x7fffce5ea700 (LWP 28005)]
[New Thread 0x7fffcbde9700 (LWP 28006)]
[New Thread 0x7fffc95e8700 (LWP 28007)]
[New Thread 0x7fffc6de7700 (LWP 28008)]
[New Thread 0x7fffc45e6700 (LWP 28009)]
[New Thread 0x7fffc1de5700 (LWP 28010)]
[New Thread 0x7fffbf5e4700 (LWP 28011)]
[New Thread 0x7fffbcde3700 (LWP 28012)]
[New Thread 0x7fffba5e2700 (LWP 28013)]
[New Thread 0x7fffb7de1700 (LWP 28014)]
[New Thread 0x7fffb55e0700 (LWP 28015)]
[New Thread 0x7fffb2ddf700 (LWP 28016)]
[New Thread 0x7fffb05de700 (LWP 28017)]
[New Thread 0x7fffadddd700 (LWP 28018)]
[New Thread 0x7fffab5dc700 (LWP 28019)]
[New Thread 0x7fffa8ddb700 (LWP 28020)]
[New Thread 0x7fffa65da700 (LWP 28021)]
[New Thread 0x7fffa3dd9700 (LWP 28022)]
[Thread 0x7fffe4df3700 (LWP 27996) exited]
[Thread 0x7fffbf5e4700 (LWP 28011) exited]
[Thread 0x7fffd5ded700 (LWP 28002) exited]
[Thread 0x7fffa65da700 (LWP 28021) exited]
[Thread 0x7fffb7de1700 (LWP 28014) exited]
[Thread 0x7fffa3dd9700 (LWP 28022) exited]
[Thread 0x7fffc6de7700 (LWP 28008) exited]
[Thread 0x7fffab5dc700 (LWP 28019) exited]
[Thread 0x7fffd35ec700 (LWP 28003) exited]
[Thread 0x7fffadddd700 (LWP 28018) exited]
[Thread 0x7fffe25f2700 (LWP 27997) exited]
[Thread 0x7fffa8ddb700 (LWP 28020) exited]
[Thread 0x7fffc1de5700 (LWP 28010) exited]
[Thread 0x7fffb05de700 (LWP 28017) exited]
[Thread 0x7fffecdf7700 (LWP 27992) exited]
[Thread 0x7fffb2ddf700 (LWP 28016) exited]
[Thread 0x7fffb55e0700 (LWP 28015) exited]
[Thread 0x7fffe75f4700 (LWP 27995) exited]
[Thread 0x7fffec5f6700 (LWP 27993) exited]
[Thread 0x7fffc45e6700 (LWP 28009) exited]
[Thread 0x7fffd0deb700 (LWP 28004) exited]
[Thread 0x7fffc95e8700 (LWP 28007) exited]
[Thread 0x7fffdd5f0700 (LWP 27999) exited]
[Thread 0x7fffcbde9700 (LWP 28006) exited]
[Thread 0x7fffdfdf1700 (LWP 27998) exited]
[Thread 0x7fffba5e2700 (LWP 28013) exited]
[Thread 0x7fffdadef700 (LWP 28000) exited]
[Thread 0x7fffe9df5700 (LWP 27994) exited]
[Thread 0x7fffbcde3700 (LWP 28012) exited]
[Thread 0x7fffd85ee700 (LWP 28001) exited]
[Thread 0x7fffce5ea700 (LWP 28005) exited]
Detaching after fork from child process 28026.
Missing separate debuginfo for /usr/username/python3/lib/python3.5/site-packages/torch/lib/libgomp-7bcb08ae.so.1
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/3e/c8c6019d7a9fb4082cae6b259de10dcd41120a.debug
Missing separate debuginfo for /usr/username/python3/lib/python3.5/site-packages/matplotlib/.libs/./libz-a147dcb0.so.1.2.3
[New Thread 0x7fffa3dd9700 (LWP 28054)]
Detaching after fork from child process 28065.
[New Thread 0x7fffa65da700 (LWP 28074)]
[New Thread 0x7fffa8ddb700 (LWP 28075)]
[New Thread 0x7fffab5dc700 (LWP 28105)]
[New Thread 0x7fff606e1700 (LWP 28106)]

Program received signal SIGSEGV, Segmentation fault.
PyLong_AsLongLongAndOverflow (vv=0x7fffb4e04030, overflow=0x7fffffffc12c) at Objects/longobject.c:1361
1361 Objects/longobject.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 expat-2.1.0-8.el7.x86_64 fontconfig-2.10.95-7.el7.x86_64 freetype-2.4.11-11.el7.x86_64 glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libX11-1.6.3-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXft-2.3.2-2.el7.x86_64 libXrender-0.9.8-2.1.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 libuuid-2.23.2-26.el7.x86_64 libxcb-1.11-4.el7.x86_64 openssl-libs-1.0.1e-42.el7.9.x86_64 pcre-8.32-15.el7.x86_64 tcl-8.5.13-8.el7.x86_64 tk-8.5.13-6.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) where
#0 PyLong_AsLongLongAndOverflow (vv=0x7fffb4e04030, overflow=0x7fffffffc12c) at Objects/longobject.c:1361
#1 0x00007fff944bfe00 in THPUtils_unpackLong (obj=) at /pytorch/torch/csrc/utils/python_numbers.h:50
#2 0x00007fff94529dd1 in THPUtils_unpackIndex (obj=) at /pytorch/torch/csrc/utils/python_numbers.h:83
#3 torch::PythonArgs::intlistWithDefault (this=0x7fffffffc2b0, i=, default_intlist=...) at /pytorch/torch/csrc/utils/python_arg_parser.h:289
#4 0x00007fff94c12062 in intlist (i=0, this=0x7fffffffc2b0) at /pytorch/torch/csrc/utils/python_arg_parser.h:264
#5 torch::autograd::THPVariable_reshape (self=0x7fffb4f6d948, args=, kwargs=) at torch/csrc/autograd/generated/python_variable_methods.cpp:4039
#6 0x00007ffff79a6961 in PyCFunction_Call (func=0x7fffb4f4de10, args=0x7fffb4f4eb08, kwds=) at Objects/methodobject.c:98
#7 0x00007ffff7a2df15 in call_function (oparg=, pp_stack=0x7fffffffc4a8) at Python/ceval.c:4705
#8 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#9 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=4, kws=0x0,
kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#10 0x00007ffff7a2f018 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=,
kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#11 0x00007ffff7984882 in function_call (func=0x7fffb76262f0, arg=0x7ffff7eea638, kw=0x0) at Objects/funcobject.c:627
#12 0x00007ffff7951576 in PyObject_Call (func=0x7fffb76262f0, arg=, kw=) at Objects/abstract.c:2165
#13 0x00007ffff7a234f3 in PyEval_CallObjectWithKeywords (func=0x7fffb76262f0, arg=0x7ffff7eea638, kw=) at Python/ceval.c:4580
#14 0x00007fff948966af in THPFunction_apply (cls=0x25cec98, inputs=0x7fffb4f4db40) at torch/csrc/autograd/python_function.cpp:745
#15 0x00007ffff79a6929 in PyCFunction_Call (func=0x7fffb75fcc60, args=0x7fffb4f4db40, kwds=) at Objects/methodobject.c:109
#16 0x00007ffff7a2df15 in call_function (oparg=, pp_stack=0x7fffffffca08) at Python/ceval.c:4705
#17 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#18 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=2,
kws=0x7ffff7f9b060, kwcount=0, defs=0x7fffb7624db8, defcount=1, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#19 0x00007ffff7a2f018 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=,
kws=, kwcount=0, defs=0x7fffb7624db8, defcount=1, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#20 0x00007ffff79849a1 in function_call (func=0x7fffb762b6a8, arg=0x7fff9b1a6d48, kw=0x7fffb4f59dc8) at Objects/funcobject.c:627
#21 0x00007ffff7951576 in PyObject_Call (func=0x7fffb762b6a8, arg=, kw=) at Objects/abstract.c:2165
#22 0x00007ffff7a2b574 in ext_do_call (nk=-1692766904, na=0, flags=, pp_stack=0x7fffffffcd58, func=0x7fffb762b6a8) at Python/ceval.c:5034
#23 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3275
#24 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=2,
kws=0x7ffebab34938, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fffb765cf08, name=0x7fffe98a91b8, qualname=0x7fffe70af6f0) at Python/ceval.c:4018
#25 0x00007ffff7a2e135 in fast_function (nk=, na=2, n=, pp_stack=0x7fffffffcf78, func=0x7fffb762b730) at Python/ceval.c:4813
#26 call_function (oparg=, pp_stack=0x7fffffffcf78) at Python/ceval.c:4730
#27 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#28 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=3,
kws=0x7fff9b21be20, kwcount=1, defs=0x7fffb765c5a0, defcount=2, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#29 0x00007ffff7a2f018 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=,
kws=, kwcount=1, defs=0x7fffb765c5a0, defcount=2, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#30 0x00007ffff79849a1 in function_call (func=0x7fffb4f58620, arg=0x7fffb4f62318, kw=0x7fffb4f59688) at Objects/funcobject.c:627
#31 0x00007ffff7951576 in PyObject_Call (func=0x7fffb4f58620, arg=, kw=) at Objects/abstract.c:2165
#32 0x00007ffff7a2b574 in ext_do_call (nk=-1258937576, na=0, flags=, pp_stack=0x7fffffffd2c8, func=0x7fffb4f58620) at Python/ceval.c:5034
#33 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3275
#34 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=3,
kws=0x7fffb4f631f8, kwcount=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fffb4f4ea88, name=0x7fffb762d2f0, qualname=0x7fffb7617078) at Python/ceval.c:4018
#35 0x00007ffff7a2e135 in fast_function (nk=, na=3, n=, pp_stack=0x7fffffffd4e8, func=0x7fffb4f586a8) at Python/ceval.c:4813
#36 call_function (oparg=, pp_stack=0x7fffffffd4e8) at Python/ceval.c:4730
#37 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#38 0x00007ffff7a2e4a6 in fast_function (nk=, na=1, n=, pp_stack=0x7fffffffd668, func=0x7fffb4f58730) at Python/ceval.c:4803
#39 call_function (oparg=, pp_stack=0x7fffffffd668) at Python/ceval.c:4730
#40 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#41 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=1,
kws=0x7ffff7f9b060, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#42 0x00007ffff7a2f018 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=,
---Type to continue, or q to quit---
kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#43 0x00007ffff79849a1 in function_call (func=0x7fffb4f58840, arg=0x7fff9b5212e8, kw=0x7fffb4f59648) at Objects/funcobject.c:627
#44 0x00007ffff7951576 in PyObject_Call (func=0x7fffb4f58840, arg=, kw=) at Objects/abstract.c:2165
#45 0x00007ffff7a2b574 in ext_do_call (nk=-1689120024, na=0, flags=, pp_stack=0x7fffffffd9b8, func=0x7fffb4f58840) at Python/ceval.c:5034
#46 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3275
#47 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=1, kws=0x700868,
kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fffb4f4eac8, name=0x7ffff017d618, qualname=0x7fffb762d4f0) at Python/ceval.c:4018
#48 0x00007ffff7a2e135 in fast_function (nk=, na=1, n=, pp_stack=0x7fffffffdbd8, func=0x7fffb4f588c8) at Python/ceval.c:4813
#49 call_function (oparg=, pp_stack=0x7fffffffdbd8) at Python/ceval.c:4730
#50 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#51 0x00007ffff7a2ee89 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=0, kws=0x0,
kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#52 0x00007ffff7a2f018 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=,
kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#53 0x00007ffff7a2f05b in PyEval_EvalCode (co=, globals=, locals=) at Python/ceval.c:777
#54 0x00007ffff7a54240 in run_mod (arena=0x697d80, flags=0x7fffffffdf20, locals=0x7ffff7f4a388, globals=0x7ffff7f4a388, filename=0x7ffff0150f38, mod=0x702088)
at Python/pythonrun.c:976
#55 PyRun_FileExFlags (fp=0x697b40, filename_str=, start=, globals=0x7ffff7f4a388, locals=0x7ffff7f4a388, closeit=,
flags=0x7fffffffdf20) at Python/pythonrun.c:929
#56 0x00007ffff7a55843 in PyRun_SimpleFileExFlags (fp=0x697b40, filename=, closeit=1, flags=0x7fffffffdf20) at Python/pythonrun.c:396
#57 0x00007ffff7a70acd in run_file (p_cf=0x7fffffffdf20, filename=0x603100 L"../runs/mnist_two_layers_1000_1000/scripts/adam_1e-3.py", fp=0x697b40) at Modules/main.c:318
#58 Py_Main (argc=, argv=) at Modules/main.c:768
#59 0x0000000000400add in main (argc=2, argv=0x7fffffffe098) at ./Programs/python.c:65
(gdb)

System Info

PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 2015062 (Red Hat 4.8.5-4)
CMake version: version 2.8.12.2

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
GPU 2: Tesla P100-PCIE-16GB
GPU 3: Tesla P100-PCIE-16GB

Nvidia driver version: 384.81
cuDNN version: Probably one of the following:
/usr/local/cudnn-v5-for-cuda-7.5/lib64/libcudnn.so.5.1.3
/usr/local/cudnn-v5-for-cuda-7.5/lib64/libcudnn_static.a
/usr/local/cudnn-v5/lib64/libcudnn.so.5.1.5
/usr/local/cudnn-v5/lib64/libcudnn_static.a
/usr/local/cudnn-v6.0-for-cuda-7.5/lib64/libcudnn.so.6.0.21
/usr/local/cudnn-v6.0-for-cuda-7.5/lib64/libcudnn_static.a
/usr/local/cudnn-v6.0-for-cuda-8.0/lib64/libcudnn.so.6.0.21
/usr/local/cudnn-v6.0-for-cuda-8.0/lib64/libcudnn_static.a
/usr/local/cudnn-v7.0-for-cuda-8.0/lib64/libcudnn.so.7.0.2
/usr/local/cudnn-v7.0-for-cuda-8.0/lib64/libcudnn_static.a
/usr/local/cudnn-v7.0-for-cuda-9.0/lib64/libcudnn.so.7.0.3
/usr/local/cudnn-v7.0-for-cuda-9.0/lib64/libcudnn_static.a
/usr/local/cudnn-v7.1-for-cuda-9.0/lib/libcudnn.so
/usr/local/cudnn-v7.1-for-cuda-9.0/lib/libcudnn.so.7
/usr/local/cudnn-v7.1-for-cuda-9.0/lib/libcudnn.so.7.1.3
/usr/local/cudnn-v7.1-for-cuda-9.0/lib/libcudnn_static.a
/usr/local/cudnn-v7.1-for-cuda-9.0/lib64/libcudnn.so
/usr/local/cudnn-v7.1-for-cuda-9.0/lib64/libcudnn.so.7
/usr/local/cudnn-v7.1-for-cuda-9.0/lib64/libcudnn.so.7.1.3
/usr/local/cudnn-v7.1-for-cuda-9.0/lib64/libcudnn_static.a
/usr/local/cudnn-v7.1-for-cuda-9.2/lib/libcudnn.so
/usr/local/cudnn-v7.1-for-cuda-9.2/lib/libcudnn.so.7
/usr/local/cudnn-v7.1-for-cuda-9.2/lib/libcudnn.so.7.1.4
/usr/local/cudnn-v7.1-for-cuda-9.2/lib/libcudnn_static.a
/usr/local/cudnn-v7.1-for-cuda-9.2/lib64/libcudnn.so
/usr/local/cudnn-v7.1-for-cuda-9.2/lib64/libcudnn.so.7
/usr/local/cudnn-v7.1-for-cuda-9.2/lib64/libcudnn.so.7.1.4
/usr/local/cudnn-v7.1-for-cuda-9.2/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

@vishwakftw
Copy link
Contributor

This is not reproducible on master, I believe. Could you please check?

@soumith
Copy link
Member

soumith commented Sep 2, 2018

indeed, fixed on master. Fix will be part of the next release.

@soumith soumith closed this as completed Sep 2, 2018
@sshaoshuai
Copy link

sshaoshuai commented Oct 24, 2018

I also meet a similar problem, but I don't have int64.
Will torch.from_numpy(np.float64) cause this problem?

Can it be fixed by install master pytorch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants