Unsupported type of tensor c10::half when running resnet50_trainer.py in Caffe2 #18000

rohithkrn · 2019-03-13T22:36:23Z

🐛 Bug

Unsupported type of tensor c10::half error in SpatialBN operator when running resnet50_trainer with float16 type on V100, cuda10, cudnn7

To Reproduce

Steps to reproduce the behavior:

docker pull pytorch/pytorch:nightly-runtime-cuda10.0-cudnn7
nvidia-docker run --entrypoint="/bin/bash" -it --ipc=host --privileged --shm-size 8G --device=/dev/kfd --device=/dev/dri --runtime=nvidia pytorch/pytorch:nightly-runtime-cuda10.0-cudnn7
pip install future networkx protobuf
python /opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py --train_data null --batch_size 64 --epoch_size 6400 --num_epochs 1 --num_gpus 1 --float16_compute --dtype float16

Console output:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:ResNe(X)t_trainer:Running on GPUs: [0]
INFO:ResNe(X)t_trainer:Using epoch size: 6400
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.01833963394165039 secs
INFO:ResNe(X)t_trainer:Starting epoch 0/1
[E net_async_base.cc:377] [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)
, op SpatialBN
[E net_async_base.cc:129] Rethrowing exception from the run of 'resnext50'
WARNING:caffe2.python.workspace:Original python traceback for operator 1 in network resnext50 in exception above (most recent call last):
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 488, in Train
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/data_parallel_model.py", line 232, in Parallelize
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 392, in create_resnext_model_ops
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/models/resnet.py", line 344, in create_resnext
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/brew.py", line 107, in scope_wrapper
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/helpers/normalization.py", line 151, in spatial_bn
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
main()
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
Train(args)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 584, in Train
explog
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 197, in RunEpoch
workspace.RunNet(train_model.net.Proto().name)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 237, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 198, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)

Expected behavior

The script should run and output images/sec data for each iteration.

Environment

PyTorch Version : 1.0.0.dev20190311
OS: Ubuntu 16.04.5 LTS
How you installed PyTorch: docker from pytorch dockerhub repo - pytorch/pytorch:nightly-runtime-cuda10.0-cudnn7
Python version: 3.6
CUDA/cuDNN version: 10/7
GPU models and configuration: Tesla V100-SXM2

The text was updated successfully, but these errors were encountered:

ezyang · 2019-03-14T03:59:25Z

This sounds like one of the Caffe2 operators doesn't support half, which I don't find terribly surprising.

rohithkrn · 2019-03-14T16:03:22Z

It errors out in SpatialBN operator which supports half. Same script runs on ROCm.

bestlin · 2019-05-03T04:22:04Z

I saw SpatialBN did support the half. I mean the input can be half. In the torch it does not have the error. How can we solve this?

pytorchbot added the caffe2 label Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupported type of tensor c10::half when running resnet50_trainer.py in Caffe2 #18000

Unsupported type of tensor c10::half when running resnet50_trainer.py in Caffe2 #18000

rohithkrn commented Mar 13, 2019 •

edited

ezyang commented Mar 14, 2019

rohithkrn commented Mar 14, 2019

bestlin commented May 3, 2019 •

edited

Unsupported type of tensor c10::half when running resnet50_trainer.py in Caffe2 #18000

Unsupported type of tensor c10::half when running resnet50_trainer.py in Caffe2 #18000

Comments

rohithkrn commented Mar 13, 2019 • edited

🐛 Bug

To Reproduce

Expected behavior

Environment

ezyang commented Mar 14, 2019

rohithkrn commented Mar 14, 2019

bestlin commented May 3, 2019 • edited

rohithkrn commented Mar 13, 2019 •

edited

bestlin commented May 3, 2019 •

edited