New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[object_detection] model_main.py failure: tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative #4798

Open
dnuffer opened this Issue Jul 17, 2018 · 24 comments

Comments

Projects
None yet
@dnuffer
Copy link

dnuffer commented Jul 17, 2018

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • What is the top-level directory of the model you are using: research/object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): ('v1.9.0-0-g25c197e023', '1.9.0')
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: 9.0/7.1
  • GPU model and memory: 1080/8GB
  • Exact command to reproduce: python /models/research/object_detection/model_main.py --pipeline_config_path=faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.config --model_dir=. --num_train_steps=1000 --num_eval_steps=1000 --alsologtostderr

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

I am trying to use transfer learning to train a model for the open images challenge. I prepared the data as tfrecord files. I downloaded faster_rcnn_inception_resnet_v2_atrous_oid from the model zoo. I created a config by modifying the number of classes and paths. When I run model_main.py to start training, it fails with the following exception:

Traceback (most recent call last):                                                                                                                                                                                                                            
  File "/models/research/object_detection/model_main.py", line 101, in <module>                                                                                                                                                                               
    tf.app.run()                                                                                                                                                                                                                                              
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run                                                                                                                                                           
    _sys.exit(main(argv))                                                                                                                                                                                                                                     
  File "/models/research/object_detection/model_main.py", line 97, in main                                                                                                                                                                                    
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])                                                                                                                                                                                     
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 447, in train_and_evaluate                                                                                                                                      
    return executor.run()                                                                                                                                                                                                                                     
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 531, in run                                                                                                                                                     
    return self.run_local()                                                                                                                                                                                                                                   
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 669, in run_local                                                                                                                                               
    hooks=train_hooks)                                                                                                                                                                                                                                        
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train                                                                                                                                                  
    loss = self._train_model(input_fn, hooks, saving_listeners)                                                                                                                                                                                               
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model                                                                                                                                          
    return self._train_model_default(input_fn, hooks, saving_listeners)                                                                                                                                                                                       
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default                                                                                                                                  
    saving_listeners)                                                                                                                                                                                                                                         
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1336, in _train_with_estimator_spec                                                                                                                            
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])                                                                                                                                                                                    
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 577, in run                                                                                                                                             
    run_metadata=run_metadata)                                                                                                                                                                                                                                
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1053, in run                                                                                                                                            
    run_metadata=run_metadata)                                                                                                                                                                                                                                
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1144, in run                                                                                                                                            
    raise six.reraise(*original_exc_info)                                                                                                                                                                                                                     
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1129, in run                                                                                                                                            
    return self._sess.run(*args, **kwargs)                                                                                                                                                                                                                    
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run                                                                                                                                            
    run_metadata=run_metadata)                                                                                                                                                                                                                                
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in run                                                                                                                                             
    return self._sess.run(*args, **kwargs)                                                                                                                                                                                                                    
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run                                                                                                                                                         
    run_metadata_ptr)                                                                                                                                                                                                                                         
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run                                                                                                                                                       
    feed_dict_tensor, options, run_metadata)                                                                                                                                                                                                                  
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run                                                                                                                                                    
    run_metadata)                                                                                                                                                                                                                                             
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call                                                                                                                                                   
    raise type(e)(node_def, op, message)                                                                                                                                                                                                                      
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -54                                                                                                                                                            
         [[Node: Pad_9 = Pad[T=DT_FLOAT, Tpaddings=DT_INT32, _device="/device:CPU:0"](cond_2/Merge, stack_9)]]                                                                                                                                                
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[1], [1,?,?,3], [1,3], [1,100], [1,100,4], [1,100,500], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLO
AT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]                                                                                                                                                                            

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

This is the config I used:

model {
  faster_rcnn {
    num_classes: 500
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: "faster_rcnn_inception_resnet_v2"
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        height_stride: 8
        width_stride: 8
        scales: 0.25
        scales: 0.5
        scales: 1.0
        scales: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 1.0
        aspect_ratios: 2.0
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.00999999977648
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.699999988079
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
        use_dropout: false
        dropout_keep_probability: 1.0
      }
    }
    second_stage_batch_size: 20
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.300000011921
        iou_threshold: 0.600000023842
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}
train_config {
  batch_size: 1
  optimizer {
    momentum_optimizer {
      learning_rate {
        manual_step_learning_rate {
          initial_learning_rate: 5.99999964379e-07
          schedule {                                                                                                      
            step: 1000                                                                                                    
            learning_rate: 5.99999984843e-05                                                                              
          }                                                                                                               
          schedule {                                                                                                      
            step: 60000                                                                                                   
            learning_rate: 6.00000021223e-06                                                                              
          }                                                                                                               
          schedule {                                                                                                      
            step: 70000                                                                                                   
            learning_rate: 6.00000021223e-07                                                                              
          }                                                                                                               
        }                                                                                                                 
      }                                                                                                                   
      momentum_optimizer_value: 0.899999976158                                                                            
    }                                                                                                                     
    use_moving_average: false                                                                                             
  }                                                                                                                       
  gradient_clipping_by_norm: 10.0                                                                                         
  fine_tune_checkpoint: "/data/object_detection_models/faster_rcnn_inception_resnet_v2_atrous_oid_2018_01_28/model.ckpt"  
  num_steps: 1000                                                                                                         
  load_all_detection_checkpoint_vars: true                                                                                
  fine_tune_checkpoint_type: "detection"                                                                                  
}                                                                                                                         
train_input_reader {                                                                                                      
  label_map_path: "/models/research/object_detection/data/oid_object_detection_challenge_500_label_map.pbtxt"             
  num_readers: 1                                                                                                          
  tf_record_input_reader {                                                                                                
    input_path: "/data/images/train_tfrecords/tfrecord-00000-of-00001"                                                    
  }                                                                                                                       
}                                                                                                                         
eval_config {                                                                                                             
  num_examples: 1000                                                                                                      
  max_evals: 10                                                                                                           
  metrics_set: "open_images_metrics"                                                                                      
  use_moving_averages: false                                                                                              
  retain_original_images: true                                                                                            
}                                                                                                                         
eval_input_reader {                                                                                                       
  label_map_path: "/models/research/object_detection/data/oid_object_detection_challenge_500_label_map.pbtxt"             
  shuffle: false                                                                                                          
  num_readers: 1                                                                                                          
  tf_record_input_reader {                                                                                                
    input_path: "/data/images/validation_tfrecords/tfrecord-00000-of-00001"                                               
  }                                                                                                                       
}                                                                                                                         

This is the complete output from model_main.py:

/usr/local/lib/python2.7/dist-packages/object_detection/utils/visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'TkAgg' by the following code:
  File "/models/research/object_detection/model_main.py", line 26, in <module>
    from object_detection import model_lib
  File "/usr/local/lib/python2.7/dist-packages/object_detection/model_lib.py", line 26, in <module>
    from object_detection import eval_util
  File "/usr/local/lib/python2.7/dist-packages/object_detection/eval_util.py", line 28, in <module>
    from object_detection.metrics import coco_evaluation
  File "/usr/local/lib/python2.7/dist-packages/object_detection/metrics/coco_evaluation.py", line 20, in <module>
    from object_detection.metrics import coco_tools
  File "/usr/local/lib/python2.7/dist-packages/object_detection/metrics/coco_tools.py", line 47, in <module>
    from pycocotools import coco
  File "build/bdist.linux-x86_64/egg/pycocotools/coco.py", line 49, in <module>
    import matplotlib.pyplot as plt
  File "/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 71, in <module>
    from matplotlib.backends import pylab_setup
  File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/__init__.py", line 16, in <module>
    line for line in traceback.format_stack()


  import matplotlib; matplotlib.use('Agg')  # pylint: disable=multiple-statements
WARNING:tensorflow:Estimator's model_fn (<function model_fn at 0x7f09e1527488>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/core/box_predictor.py:407: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py:2037: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [global_step] is not available in checkpoint
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/core/losses.py:317: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/core/losses.py:317: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-07-17 12:48:43.762318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-17 12:48:44.053889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:05:00.0
totalMemory: 7.93GiB freeMemory: 7.81GiB
2018-07-17 12:48:44.053946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-17 12:48:44.250098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-17 12:48:44.250181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0
2018-07-17 12:48:44.250200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N
2018-07-17 12:48:44.250436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7541 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:05:00.0
, compute capability: 6.1)
2018-07-17 12:49:28.560860: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.53GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:28.562139: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.56GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.242122: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.11GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.267979: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.300060: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.651405: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.62GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:42.910921: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.11GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:42.935985: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:42.967801: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:45.355076: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
  File "/models/research/object_detection/model_main.py", line 101, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/models/research/object_detection/model_main.py", line 97, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 447, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 531, in run
    return self.run_local()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 669, in run_local
    hooks=train_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1336, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 577, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1053, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1144, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1129, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -54
         [[Node: Pad_9 = Pad[T=DT_FLOAT, Tpaddings=DT_INT32, _device="/device:CPU:0"](cond_2/Merge, stack_9)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[1], [1,?,?,3], [1,3], [1,100], [1,100,4], [1,100,500], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
@mawah

This comment has been minimized.

Copy link

mawah commented Jul 17, 2018

I'm also curious about this issue; I'm receiving an analogous error message and, like dnuffer, I'm getting the message when using custom tfrecord files. In my case, I've packaged everything into a docker image, mawah/debug:tfodapiretinanet, built off of tensorflow/tensorflow:1.8.0-gpu and commit e2d4637 of this repository. The command to reproduce is:

python /retinanet/models/research/object_detection/model_main.py
    --pipeline_config_path=/retinanet/scripts/retinanet.config
    --model_dir=/retinanet/output_model
    --num_train_steps=25000
    --num_eval_steps=8000
    --alsologtostderr

I've noticed that my error message is not exactly the same if I run the command twice -- the negative integer in the first line below changes each time.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -10
         [[Node: Pad_9 = Pad[T=DT_FLOAT, Tpaddings=DT_INT32](cond_2/Merge, stack_9)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[64], [64,640,640,3], [64,3], [64,100], [64,100,4], [64,100,60], [64,100], [64,100], [64,100], [64]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
         [[Node: IteratorGetNext/_3911 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_892_IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Thanks very much for your help!

@dnuffer

This comment has been minimized.

Copy link
Author

dnuffer commented Jul 17, 2018

I also see the negative integer change every run.
I tried using both image resizers (keep_aspect_ratio_resizer and fixed_shape_resizer), but that didn't make a difference.
I tried using images that were all resized to 320x240 in the tfrecords, but that also didn't make a difference.
I thought this might be related to data augmentation, so I tried using various combinations of no augmentation, random_crop_pad_image, random_crop_image, and random_horizontal_flip, but every combination also yielded this same InvalidArgumentError.

@mayorquinmachines

This comment has been minimized.

Copy link

mayorquinmachines commented Jul 17, 2018

@dnuffer Also found changing values and started looking to our image dataset.
Initially suspected a problem with cropping or normalization of bounding boxes or h,w versus w,h image size conventions.
Filtered out all but the square images.
With majority 500x500 images, I filtered out the few 300x300.
In the config, I replaced the crop dimensions 300x300 with 500x500 (no crop).
This allowed me to begin training but eventually the InvalidArgumentError reappears.
Will be on the lookout for the fix!

@dnuffer

This comment has been minimized.

Copy link
Author

dnuffer commented Jul 18, 2018

I tried using 500x500 images together with

    image_resizer {
      fixed_shape_resizer {
        height: 500
        width: 500
      }
    }

in my config, but unfortunately I still got the error.

@MingRuey

This comment has been minimized.

Copy link

MingRuey commented Jul 19, 2018

Experience same issue with using pre-trained Faster-RCNN models on Open Images Challenge 2018 dataset.

Most Faster-RCNNs listed in the zoo will lead me to almost the same error (while other models give me some other Errors so I can not try.)

For example, with pre-trained "faster_rcnn_inception_resnet_v2_atrous_oid" and the following config will generate ...InvalidArgumentError: Paddings must be non-negative: ... [[Node: Pad_9 = Pad[...](cond_2/Merge, stack_9)]].

  • I suspect this to be data dependent, since change the order of data may make Error happen after more batches
  • Noted that I did not use data augmentation at all.
  • I've also tried fixed_shape_resizer and didn't work, along with some other random tries including set unpad_groundtruth_tensors to False.
model {  
  faster_rcnn {
    num_classes: 500
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: "faster_rcnn_inception_resnet_v2"
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.00006
          schedule {
            step: 100
            learning_rate: .000006
          }
          schedule {
            step: 1000
            learning_rate: .0000006
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  unpad_groundtruth_tensors: false

  from_detection_checkpoint: True
  fine_tune_checkpoint: ".../pretrained-model.ckpt"
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/....tfrecord"
  }
  label_map_path: ".../label_map.pbtxt"
}


eval_config: {
  metrics_set: "coco_detection_metrics"
  num_examples: 1000
  max_evals: 5
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/....tfrecord"
  }
  label_map_path: ".../label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
@tenoyart

This comment has been minimized.

Copy link

tenoyart commented Jul 19, 2018

I also met this error when I trained on my customized data. It turned out that some of the bounding boxes of target objects were too small inside my training image. When I filtered out those small bounding boxes and kept larger ones, I could run without such errors.

@ronykalfarisi

This comment has been minimized.

Copy link

ronykalfarisi commented Jul 20, 2018

Whoever read this and still hasn't been successful in training your custom data set, abandon model_main.py and go back using the train.py. I got error using model_main.py and I decided to use train.py, everything works like a charm. I even got awesome result. Good luck

@5000commas

This comment has been minimized.

Copy link

5000commas commented Jul 23, 2018

I'm having the same issue. Could it be something to do with overlapping label boxes of same class?

@mayorquinmachines

This comment has been minimized.

Copy link

mayorquinmachines commented Jul 23, 2018

It has been 60 hours of training without the InvalidArgumentError after filtering out all the boxes where box width < image width / 20, same for height.
I believe @tenoyart 's suggestion was at the heart of it.

@xuehuachunsheng

This comment has been minimized.

Copy link

xuehuachunsheng commented Jul 24, 2018

I want to fine-tune 'ssd_inception_v2_coco_2018_01_28' model with 'open image dataset v4'.
The same error occurred......
So could you tell me how to filter out all the small boxes? or provide sample codes?
@mayorquinmachines @tenoyart
Since the TFrecords of open image data are already generated, i can't regenerate this dataset by pre-removing small boxes because its too large.
How can i remove those boxes when i read a TF example?

@dnuffer

This comment has been minimized.

Copy link
Author

dnuffer commented Jul 24, 2018

I am able to use legacy/{train,eval}.py with the exact same config and dataset without any problem, so this problem seems to be related to something that model_main.py is doing differently.

@mayorquinmachines

This comment has been minimized.

Copy link

mayorquinmachines commented Jul 25, 2018

We used labelimg and so we made use of ElementTree to load the xml files.
Something like

xml_file = '/path/to/annotation.xml'
tree = ET.parse(xml_file)
root = tree.getroot()
for obj in root.findall('object'):
for bx in obj.findall('bndbox'):
bx_coord = {coord.tag: int(coord.text) for coord in bx.getchildren()}

This puts us at a point where we can look at (ymax - ymin) / height and set a threshold like 0.05.
Likewise for xmax, xmin and we use element tree to remove the object and then overwrite the file.

For the csv files, you might load the data into a pandas dataframe, add a column as the difference between ymax/ymin and xmax/xmin, then a query rows where the difference is big enough to keep, write to file and rebuild the tf-records.

@gfelbing

This comment has been minimized.

Copy link

gfelbing commented Jul 25, 2018

I agree with @dnuffer.
I had the same issue, toggling augmentations and removing small boxes did not help in my case.
But I am able to use the legacy/{train,eval}.py with the very same configuration.

@huangynn

This comment has been minimized.

Copy link

huangynn commented Jul 30, 2018

Any tf contributor can solve this? Cause legacy/train.py is able to train, which means model_main.py has something unpredictable

@zhangmengya

This comment has been minimized.

Copy link

zhangmengya commented Jul 30, 2018

@tenoyart hello, could you explain why it can run without small bounding boxes. I'm very confused. Thank you.

@pkulzc

This comment has been minimized.

Copy link
Contributor

pkulzc commented Jul 31, 2018

I'm looking into this now.

@pkulzc

This comment has been minimized.

Copy link
Contributor

pkulzc commented Aug 1, 2018

Changes in my recent PR should be able to fix this issue. Please sync your fork to latest and re-run once the PR gets merged. Thanks!

@ashwaniag

This comment has been minimized.

Copy link

ashwaniag commented Aug 2, 2018

@pkulzc I pulled the latest changes but now I get different error. Hope you can help!
tensorflow/tensorflow#21320

@kulkarnivishal

This comment has been minimized.

Copy link

kulkarnivishal commented Aug 7, 2018

The issue is not fixed yet, I still get the error "Expected size[0] in [0, 100], but got 225"

@zhangmengya

This comment has been minimized.

Copy link

zhangmengya commented Aug 7, 2018

I solve the issue by enlarging the parameter value, max_number_of_boxes which is in input.proto, then it runs well

@adriangreen

This comment has been minimized.

Copy link

adriangreen commented Aug 7, 2018

I was getting the error "Expected size[0] in [0, 100], but got 225" after pulling and was able to solve it by explicitly specifying a large value for max_number_of_boxes here:

https://github.com/tensorflow/models/blob/master/research/object_detection/inputs.py#L391

@kulkarnivishal

This comment has been minimized.

Copy link

kulkarnivishal commented Aug 7, 2018

Thank you. I believe an official fix is due in next release. Until then I am using legacy/train.py for my training.

@pkulzc

This comment has been minimized.

Copy link
Contributor

pkulzc commented Aug 7, 2018

I sent another PR that should be able to fix the "Expected size[0] in [0, 100], but got 225" issue. A relevant FAQ question is also added to help understand this issue.

@akulubala

This comment has been minimized.

Copy link

akulubala commented Aug 18, 2018

meet same problem here, hope this could fix soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment