Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported type of tensor c10::half when running resnet50_trainer.py in Caffe2 #18000

Open
rohithkrn opened this issue Mar 13, 2019 · 3 comments
Labels

Comments

@rohithkrn
Copy link
Contributor

rohithkrn commented Mar 13, 2019

馃悰 Bug

Unsupported type of tensor c10::half error in SpatialBN operator when running resnet50_trainer with float16 type on V100, cuda10, cudnn7

To Reproduce

Steps to reproduce the behavior:

  1. docker pull pytorch/pytorch:nightly-runtime-cuda10.0-cudnn7
  2. nvidia-docker run --entrypoint="/bin/bash" -it --ipc=host --privileged --shm-size 8G --device=/dev/kfd --device=/dev/dri --runtime=nvidia pytorch/pytorch:nightly-runtime-cuda10.0-cudnn7
  3. pip install future networkx protobuf
  4. python /opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py --train_data null --batch_size 64 --epoch_size 6400 --num_epochs 1 --num_gpus 1 --float16_compute --dtype float16

Console output:
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:ResNe(X)t_trainer:Running on GPUs: [0]
INFO:ResNe(X)t_trainer:Using epoch size: 6400
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.01833963394165039 secs
INFO:ResNe(X)t_trainer:Starting epoch 0/1
[E net_async_base.cc:377] [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)
, op SpatialBN
[E net_async_base.cc:129] Rethrowing exception from the run of 'resnext50'
WARNING:caffe2.python.workspace:Original python traceback for operator 1 in network resnext50 in exception above (most recent call last):
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 488, in Train
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/data_parallel_model.py", line 232, in Parallelize
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 392, in create_resnext_model_ops
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/models/resnet.py", line 344, in create_resnext
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/brew.py", line 107, in scope_wrapper
WARNING:caffe2.python.workspace: File "/opt/conda/lib/python3.6/site-packages/caffe2/python/helpers/normalization.py", line 151, in spatial_bn
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 690, in
main()
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 685, in main
Train(args)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 584, in Train
explog
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/examples/resnet50_trainer.py", line 197, in RunEpoch
workspace.RunNet(train_model.net.Proto().name)
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 237, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/opt/conda/lib/python3.6/site-packages/caffe2/python/workspace.py", line 198, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at operator.h:1116] . Unsupported type of tensor: c10::Half
Error from operator:
input: "gpu_0/conv1" input: "gpu_0/conv1_spatbn_relu_s" input: "gpu_0/conv1_spatbn_relu_b" input: "gpu_0/conv1_spatbn_relu_rm" input: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu" output: "gpu_0/conv1_spatbn_relu_rm" output: "gpu_0/conv1_spatbn_relu_riv" output: "gpu_0/conv1_spatbn_relu_sm" output: "gpu_0/conv1_spatbn_relu_siv" name: "" type: "SpatialBN" arg { name: "order" s: "NCHW" } arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "epsilon" f: 1e-05 } arg { name: "momentum" f: 0.9 } arg { name: "is_test" i: 0 } device_option { device_type: 1 device_id: 0 }frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa5eabf72b1 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const
, int, char const
, std::string const&, void const
) + 0x49 (0x7fa5eabf70c9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: + 0x2b7a325 (0x7fa5edbad325 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #3: + 0x16032b5 (0x7fa5ec6362b5 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #4: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7fa625ccac74 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: + 0x192e7b9 (0x7fa625cd17b9 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: c10::ThreadPool::main_loop(unsigned long) + 0x263 (0x7fa5eabf0d53 in /opt/conda/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #7: + 0xb8678 (0x7fa62f22f678 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: + 0x76ba (0x7fa63df3f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x6d (0x7fa63dc7541d in /lib/x86_64-linux-gnu/libc.so.6)

Expected behavior

The script should run and output images/sec data for each iteration.

Environment

  • PyTorch Version : 1.0.0.dev20190311
  • OS: Ubuntu 16.04.5 LTS
  • How you installed PyTorch: docker from pytorch dockerhub repo - pytorch/pytorch:nightly-runtime-cuda10.0-cudnn7
  • Python version: 3.6
  • CUDA/cuDNN version: 10/7
  • GPU models and configuration: Tesla V100-SXM2
@ezyang
Copy link
Contributor

ezyang commented Mar 14, 2019

This sounds like one of the Caffe2 operators doesn't support half, which I don't find terribly surprising.

@rohithkrn
Copy link
Contributor Author

It errors out in SpatialBN operator which supports half. Same script runs on ROCm.

@bestlin
Copy link

bestlin commented May 3, 2019

I saw SpatialBN did support the half. I mean the input can be half. In the torch it does not have the error. How can we solve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants