ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79 #13304

gf19880710 · 2018-10-30T10:02:12Z

🐛 Bug

Hello Great programmers:
When I was using FAIR's platform Detectron to do training with e2e_mask_rcnn_R-101-FPN_3x_gn.yaml config file, I faced this issue which indicated me to report one BUG to Pytorch.

[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 6.1526e-05 secs
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 5.3939e-05 secs
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 7.264e-06 secs
[I net_async_base.h:198] Using specified CPU pool size: 4; NUMA node id: -1
[I net_async_base.h:203] Created new CPU pool, size: 4; NUMA node id: -1
[E net_async_base.cc:422] IsType<T>() ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79, please report a bug to PyTorch. wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor.
Offending Blob name: gpu_0/conv1_gn_s.
Error from operator: 
input: "gpu_0/conv1" input: "gpu_0/conv1_gn_s" input: "gpu_0/conv1_gn_b" output: "gpu_0/conv1_gn" output: "gpu_0/conv1_gn_mean" output: "gpu_0/conv1_gn_std" name: "" type: "GroupNorm" arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 0 } arg { name: "group" i: 32 } arg { name: "epsilon" f: 1e-05 } device_option { device_type: 1 device_id: 0 } (Get at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79)
frame #0: <unknown function> + 0x277d775 (0x7fa4c1984775 in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #1: <unknown function> + 0x1321685 (0x7fa4c0528685 in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #2: caffe2::AsyncNetBase::run(int, int) + 0x16e (0x7fa4ec15f4ee in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #3: <unknown function> + 0x1259972 (0x7fa4ec16e972 in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #4: <unknown function> + 0x124d1cb (0x7fa4ec1621cb in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: <unknown function> + 0xafc5c (0x7fa4f6227c5c in /home/gengfeng/anaconda3/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0x76db (0x7fa4fd2806db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x3f (0x7fa4fcfa988f in /lib/x86_64-linux-gnu/libc.so.6)
,  op GroupNorm
WARNING workspace.py: 187: Original python traceback for operator `1` in network `generalized_rcnn` in exception above (most recent call last):
WARNING workspace.py: 192:   File "tools/train_net.py", line 132, in <module>
WARNING workspace.py: 192:   File "tools/train_net.py", line 117, in main
WARNING workspace.py: 192:   File "tools/train_net.py", line 127, in test_model
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 128, in run_inference
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 108, in result_getter
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 159, in test_net_on_dataset
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 235, in test_net
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 328, in initialize_model_from_cfg
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/model_builder.py", line 124, in create
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/model_builder.py", line 89, in generalized_rcnn
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/model_builder.py", line 229, in build_generic_detection_model
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/optimizer.py", line 54, in build_data_parallel_model
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/model_builder.py", line 169, in _single_gpu_build_func
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/FPN.py", line 63, in add_fpn_ResNet101_conv5_body
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/FPN.py", line 104, in add_fpn_onto_conv_body
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/ResNet.py", line 99, in add_ResNet_convX_body
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/ResNet.py", line 264, in basic_gn_stem
WARNING workspace.py: 192:   File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/modeling/detector.py", line 450, in ConvGN
WARNING workspace.py: 192:   File "/home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/cnn.py", line 165, in SpatialGN
WARNING workspace.py: 192:   File "/home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/brew.py", line 107, in scope_wrapper
WARNING workspace.py: 192:   File "/home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/helpers/normalization.py", line 206, in spatial_gn
Traceback (most recent call last):
  File "tools/train_net.py", line 132, in <module>
    main()
  File "tools/train_net.py", line 117, in main
    test_model(checkpoints['final'], args.multi_gpu_testing, args.opts)
  File "tools/train_net.py", line 127, in test_model
    check_expected_results=True,
  File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 128, in run_inference
    all_results = result_getter()
  File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 108, in result_getter
    multi_gpu=multi_gpu_testing
  File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 159, in test_net_on_dataset
    weights_file, dataset_name, proposal_file, output_dir, gpu_id=gpu_id
  File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test_engine.py", line 258, in test_net
    model, im, box_proposals, timers
  File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test.py", line 66, in im_detect_all
    model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, boxes=box_proposals
  File "/home/gengfeng/Desktop/projects/DETECTRON/detectron/core/test.py", line 158, in im_detect_bbox
    workspace.RunNet(model.net.Proto().name)
  File "/home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/workspace.py", line 219, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: IsType<T>() ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79, please report a bug to PyTorch. wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor.
Offending Blob name: gpu_0/conv1_gn_s.
Error from operator: 
input: "gpu_0/conv1" input: "gpu_0/conv1_gn_s" input: "gpu_0/conv1_gn_b" output: "gpu_0/conv1_gn" output: "gpu_0/conv1_gn_mean" output: "gpu_0/conv1_gn_std" name: "" type: "GroupNorm" arg { name: "use_cudnn" i: 1 } arg { name: "cudnn_exhaustive_search" i: 0 } arg { name: "group" i: 32 } arg { name: "epsilon" f: 1e-05 } device_option { device_type: 1 device_id: 0 } (Get at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79)
frame #0: <unknown function> + 0x277d775 (0x7fa4c1984775 in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #1: <unknown function> + 0x1321685 (0x7fa4c0528685 in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #2: caffe2::AsyncNetBase::run(int, int) + 0x16e (0x7fa4ec15f4ee in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #3: <unknown function> + 0x1259972 (0x7fa4ec16e972 in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #4: <unknown function> + 0x124d1cb (0x7fa4ec1621cb in /home/gengfeng/anaconda3/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #5: <unknown function> + 0xafc5c (0x7fa4f6227c5c in /home/gengfeng/anaconda3/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0x76db (0x7fa4fd2806db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x3f (0x7fa4fcfa988f in /lib/x86_64-linux-gnu/libc.so.6)

To Reproduce

Steps to reproduce the behavior:

Expected behavior

no exception or bug for training and testing.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

gengfeng@ai-work-4:~/Downloads$ python collect_env.py
Collecting environment information...
PyTorch version: 1.0.0.dev20181015
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1) 5.5.0 20171010
CMake version: version 3.11.4

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: GPU 0: GeForce GTX 1080
Nvidia driver version: 390.87
cuDNN version: Probably one of the following:
/usr/local/cuda-9.0/lib64/libcudnn.so
/usr/local/cuda-9.0/lib64/libcudnn.so.7
/usr/local/cuda-9.0/lib64/libcudnn.so.7.2.1
/usr/local/cuda-9.0/lib64/libcudnn_static.a
/usr/local/cuda-9.1/lib64/libcudnn.so
/usr/local/cuda-9.1/lib64/libcudnn.so.7.1.3
/usr/local/cuda-9.1/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] Could not collect
[conda] cuda91 1.0 h4c16780_0 pytorch
[conda] pytorch-nightly 1.0.0.dev20181015 py3.6_cuda9.0.176_cudnn7.1.2_0 pytorch
[conda] torch 0.4.0
[conda] torchvision 0.2.1


My own configs *e2e_mask_rcnn_R-101-FPN_3x_gn.yaml*
MODEL:
  TYPE: generalized_rcnn
  CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
  NUM_CLASSES: 2
  FASTER_RCNN: True
  MASK_ON: True
NUM_GPUS: 1
SOLVER:
  WEIGHT_DECAY: 0.0001
  LR_POLICY: steps_with_decay
  BASE_LR: 0.002
  GAMMA: 0.1
  MAX_ITER: 70000
  STEPS: [0, 40000, 60000]
FPN:
  FPN_ON: True
  MULTILEVEL_ROIS: True
  MULTILEVEL_RPN: True
  USE_GN: True  # Note: use GN on the FPN-specific layers
RESNETS:
  STRIDE_1X1: False  # default True for MSRA; False for C2 or Torch models
  TRANS_FUNC: bottleneck_gn_transformation  # Note: this is a GN bottleneck transform
  STEM_FUNC: basic_gn_stem  # Note: this is a GN stem
  SHORTCUT_FUNC: basic_gn_shortcut  # Note: this is a GN shortcut
FAST_RCNN:
  ROI_BOX_HEAD: fast_rcnn_heads.add_roi_Xconv1fc_gn_head  # Note: this is a Conv GN head
  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 7
  ROI_XFORM_SAMPLING_RATIO: 2
MRCNN:
  ROI_MASK_HEAD: mask_rcnn_heads.mask_rcnn_fcn_head_v1up4convs_gn  # Note: this is a GN mask head
  RESOLUTION: 28  # (output mask resolution) default 14
  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 14  # default 7
  ROI_XFORM_SAMPLING_RATIO: 2  # default 0
  DILATION: 1  # default 2
  CONV_INIT: MSRAFill  # default GaussianFill
TRAIN:
  WEIGHTS: https://s3-us-west-2.amazonaws.com/detectron/ImageNetPretrained/47592356/R-101-GN.pkl  # Note: a GN pre-trained model
  DATASETS: ('coco_labelme_train',)
  SCALES: (700,)
  MAX_SIZE: 1333
  BATCH_SIZE_PER_IM: 64
  RPN_PRE_NMS_TOP_N: 2000  # Per FPN level
TEST:
  DATASETS: ('coco_labelme_val',)
  SCALE: 700
  MAX_SIZE: 1333
  NMS: 0.5
  RPN_PRE_NMS_TOP_N: 1000  # Per FPN level
  RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

Additional context

By the way, e2e_mask_rcnn_R-50-FPN_1x.yaml config works fine for me.

Waiting your response, thank you .

The text was updated successfully, but these errors were encountered:

Dene33 · 2018-12-06T23:11:45Z

Have the same problem. Have you managed how to fix it?

arjun-kava · 2018-12-07T05:28:09Z

@gf19880710, Same error is generated with "e2e_mask_rcnn_R-50-FPN_1x.yaml" also.

arjun-kava · 2018-12-07T05:36:18Z

The problem is already trained model available in OUTPUT_DIR which create conflict with test_model somehow. after moving already trained model it is working fine.

tleers · 2018-12-13T14:59:34Z

The problem is already trained model available in OUTPUT_DIR which create conflict with test_model somehow. after moving already trained model it is working fine.

Hey, I'm experiencing the same error while running at inference time, so no test model that can conflict. Any clue what could be going wrong?

arjun-kava · 2018-12-14T04:11:42Z

@tleers , Just check have you trained same model with any other configuration? if so, mode files from output directory to other directory.

qvks · 2019-06-01T12:42:06Z

I am experiencing the same issue during inference time, have not trained the same model with any other configuration, and there is nothing in OUTPUT_DIR (made a new one). Does anyone know how to solve this?

AkashKabra11 · 2019-06-29T14:17:04Z

Hi @arjun-kava, @qvks, @tleers ,
I am getting a similar kind of error on FAIR's detectron.

ASSERT FAILED at /pytorch/aten/src/ATen/core/blob.h:77, please report a bug to PyTorch. wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor.
Offending Blob name: gpu_0/conv1_w.
Error from operator: 
input: "gpu_0/data" input: "gpu_0/conv1_w" output: "gpu_0/conv1" name: "" type: "Conv" arg { name: "kernel" i: 7 } arg { name: "order" s: "NCHW" } arg { name: "pad" i: 3 } arg { name: "stride" i: 2 } arg { name: "exhaustive_search" i: 0 } device_option { device_type: 1 device_id: 0 } engine: "CUDNN" (Get at /pytorch/aten/src/ATen/core/blob.h:77)
**** And then similar frame# XYZ traceback*****

I am using e2e_faster_rcnn_R-50-FPN_1x.yaml. I have trained FASTER-RCNN with FPN on a custom dataset with 12 classes. Also, there is no other trained model in OUTPUT_DIR.

I am using google colab, so I have CUDA 10.0 with CUDNN 7.501 environment available.
Could someone fix this issue? Is there some issue with the layer name or is it a bug of pytorch? (The error trace says "please report a bug to PyTorch")

cloudjay · 2019-07-10T06:09:18Z

Hi. Not sure this might help but basically I'm getting this same error, Detectron inference with pretrained model from Model Zoo, and --cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-50-FPN_1x.yaml

[E net_async_base.cc:382] IsType<T>() INTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1562648889042/work/aten/src/ATen/core/blob.h:77, please report a bug to PyTorch. wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor. Offending Blob name: gpu_0/conv_rpn_fpn2_w.

I'm using Python 2.7.14 Anaconda, Ubuntu 18.04, and using this same machine for the other ML training project, so CUDA and CuDNN should be working properly.

ezyang added the caffe2 label Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79 #13304

ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79 #13304

gf19880710 commented Oct 30, 2018

Dene33 commented Dec 6, 2018

arjun-kava commented Dec 7, 2018

arjun-kava commented Dec 7, 2018

tleers commented Dec 13, 2018

arjun-kava commented Dec 14, 2018

qvks commented Jun 1, 2019

AkashKabra11 commented Jun 29, 2019

cloudjay commented Jul 10, 2019

ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79 #13304

ASSERT FAILED at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/ATen/core/blob.h:79 #13304

Comments

gf19880710 commented Oct 30, 2018

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Dene33 commented Dec 6, 2018

arjun-kava commented Dec 7, 2018

arjun-kava commented Dec 7, 2018

tleers commented Dec 13, 2018

arjun-kava commented Dec 14, 2018

qvks commented Jun 1, 2019

AkashKabra11 commented Jun 29, 2019

cloudjay commented Jul 10, 2019