You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Console output:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:ResNe(X)t_trainer:Running on GPUs: [0]
INFO:ResNe(X)t_trainer:Using epoch size: 6400
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.01833963394165039 secs
INFO:ResNe(X)t_trainer:Starting epoch 0/1
[E net_async_base.cc:377] [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)
, op SpatialBN
[E net_async_base.cc:129] Rethrowing exception from the run of 'resnext50'
WARNING:caffe2.python.workspace:Original python traceback for operator 1 in network resnext50 in exception above (most recent call last):
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 488, in Train
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/data_parallel_model.py", line 232, in Parallelize
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 392, in create_resnext_model_ops
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/models/resnet.py", line 344, in create_resnext
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/brew.py", line 107, in scope_wrapper
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/helpers/normalization.py", line 151, in spatial_bn
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
main()
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
Train(args)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 584, in Train
explog
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 197, in RunEpoch
workspace.RunNet(train_model.net.Proto().name)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 237, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 198, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)
Expected behavior
The script should run and output images/sec data for each iteration.
Environment
PyTorch Version : 1.0.0.dev20190311
OS: Ubuntu 16.04.5 LTS
How you installed PyTorch: docker from pytorch dockerhub repo - pytorch/pytorch:nightly-runtime-cuda10.0-cudnn7
Python version: 3.6
CUDA/cuDNN version: 10/7
GPU models and configuration: Tesla V100-SXM2
The text was updated successfully, but these errors were encountered:
馃悰 Bug
Unsupported type of tensor c10::half error in SpatialBN operator when running resnet50_trainer with float16 type on V100, cuda10, cudnn7
To Reproduce
Steps to reproduce the behavior:
Console output:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:ResNe(X)t_trainer:Running on GPUs: [0]
INFO:ResNe(X)t_trainer:Using epoch size: 6400
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.01833963394165039 secs
INFO:ResNe(X)t_trainer:Starting epoch 0/1
[E net_async_base.cc:377] [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)
, op SpatialBN
[E net_async_base.cc:129] Rethrowing exception from the run of 'resnext50'
WARNING:caffe2.python.workspace:Original python traceback for operator
1
in networkresnext50
in exception above (most recent call last):WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 488, in Train
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/data_parallel_model.py", line 232, in Parallelize
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 392, in create_resnext_model_ops
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/models/resnet.py", line 344, in create_resnext
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/brew.py", line 107, in scope_wrapper
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/helpers/normalization.py", line 151, in spatial_bn
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
main()
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
Train(args)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 584, in Train
explog
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 197, in RunEpoch
workspace.RunNet(train_model.net.Proto().name)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 237, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 198, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)
Expected behavior
The script should run and output images/sec data for each iteration.
Environment
The text was updated successfully, but these errors were encountered: