Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint of detection pipeline for ssd_mobilenet_v2_320x320_coco17_tpu-8.config #8875

Open
3 tasks done
DeepleMass opened this issue Jul 15, 2020 · 10 comments
Open
3 tasks done
Assignees
Labels

Comments

@DeepleMass
Copy link

DeepleMass commented Jul 15, 2020

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

ssd_mobilenet_v2_320x320_coco17_tpu-8.config at line 145 and below

2. Describe the bug

I cannot find any checkpoint corresponding to mobilenet_v2.ckpt-1 in the directories from pre trained model ssd_mobilenet_v2_320x320_coco17_tpu-8.tar.gz

At line 146 of file ssd_mobilenet_v2_320x320_coco17_tpu-8.config type may be not 'classification' but 'detection'?

3. Steps to reproduce

Fine tune the model on your favorite custom tfrecords using the above mentioned chekpoints and configs

4. Expected behavior

I expected the train script to find the checkpoints stored in ssd_mobilenet_v2_320x320_coco17_tpu-8.tar.gz (unziped on my local storage of course) and fine tune the model

5. Additional context

I only use the official script model_main_tf2.py script model_main_tf2.py (no custom coding)

6. System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 18.04
  • Mobile device name if the issue happens on a mobile device: None
  • TensorFlow installed from (source or binary): installed according to official web site official web site
  • TensorFlow version (use command below): 2.2.0
  • Python version: Python 3.7.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: None (not famous but enough for my fine tuning)
  • GPU model and memory: None

7. pipeline


model {
  ssd {
    num_classes: 1
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2_keras"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029999999329447746
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.9700000286102295
          center: true
          scale: true
          epsilon: 0.0010000000474974513
          train: true
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.009999999776482582
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.9700000286102295
            center: true
            scale: true
            epsilon: 0.0010000000474974513
            train: true
          }
        }
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.800000011920929
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: -4.599999904632568
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.20000000298023224
        max_scale: 0.949999988079071
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.33329999446868896
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.6000000238418579
        max_detections_per_class: 100
        max_total_detections: 100
        use_static_shapes: false
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
          delta: 1.0
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 2.0
          alpha: 0.75
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
  }
}
train_config {
  batch_size: 32

  data_augmentation_options {
    random_horizontal_flip {
    }
  }

  data_augmentation_options {
    ssd_random_crop {
    }

    random_adjust_hue {
    }
  
    random_adjust_saturation {
    }
  
    random_jitter_boxes{
    }
    
    random_patch_gaussian{
    }
    
    random_jpeg_quality{
    }
    
    random_distort_color{
    }
    
    random_pad_image{
    }
    
    ssd_random_crop_fixed_aspect_ratio{
    }
  
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.800000011920929
          total_steps: 50000
          warmup_learning_rate: 0.13333000242710114
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.8999999761581421
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/"
  # num_steps: 50000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 10
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  fine_tune_checkpoint_version: V2
}
train_input_reader {
  label_map_path: "label.pbtxt"
  tf_record_input_reader {
    input_path: "train.tfrecord"
  }
}
eval_config {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}
eval_input_reader {
  label_map_path: "label.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "test.tfrecord"
  }
}

8. Error message produced

python tf2/models/research/object_detection/model_main_tf2.py --model_dir=tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ --pipeline_config_path=tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8/pipeline.config --train_dir=tf2/train/ --alsologtostderr 
2020-07-15 19:50:06.572126: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-15 19:50:09.492668: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2020-07-15 19:50:09.492802: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (orquideaWindt): /proc/driver/nvidia/version does not exist
2020-07-15 19:50:09.493613: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-15 19:50:09.524280: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2699905000 Hz
2020-07-15 19:50:09.525184: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7ff144000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-15 19:50:09.525238: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
W0715 19:50:09.531138 140676925527872 cross_device_ops.py:1175] There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0715 19:50:09.533139 140676925527872 mirrored_strategy.py:500] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0715 19:50:09.541121 140676925527872 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0715 19:50:09.541361 140676925527872 config_util.py:552] Maybe overwriting use_bfloat16: False
2020-07-15 19:50:09.598769: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: Unable to open table file tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tf2/models/research/object_detection/model_main_tf2.py", line 106, in <module>
    tf.compat.v1.app.run()
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "tf2/models/research/object_detection/model_main_tf2.py", line 103, in main
    use_tpu=FLAGS.use_tpu)
  File "/home/nicolas/Dokumente/development/doorDetection/tf2/models/research/object_detection/model_lib_v2.py", line 554, in train_loop
    unpad_groundtruth_tensors)
  File "/home/nicolas/Dokumente/development/doorDetection/tf2/models/research/object_detection/model_lib_v2.py", line 335, in load_fine_tune_checkpoint
    if not is_object_based_checkpoint(checkpoint_path):
  File "/home/nicolas/Dokumente/development/doorDetection/tf2/models/research/object_detection/model_lib_v2.py", line 298, in is_object_based_checkpoint
    var_names = [var[0] for var in tf.train.list_variables(checkpoint_path)]
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 98, in list_variables
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 67, in load_checkpoint
    return py_checkpoint_reader.NewCheckpointReader(filename)
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 99, in NewCheckpointReader
    error_translator(e)
  File "/home/nicolas/anaconda3/envs/doorSupTf2/lib/python3.7/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 44, in error_translator
    raise errors_impl.DataLossError(None, None, error_message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

thanks for a feedback

@DeepleMass DeepleMass added models:research models that come under research directory type:bug Bug in the code labels Jul 15, 2020
@sainisanjay
Copy link

sainisanjay commented Jul 17, 2020

Hi @DeepleMass,
After installing TF2.2 using official web site but when i open another terminal, i cant import(say no TensorFlow installed). But when i run docker run -it od and in terminal tensorflow@c20aa0f0bac0:~/models/research$, i am able to import the TensorFlow successfully but when i try to run any python script which uses TF object detection API
test

@sambhusuryamohan
Copy link

@DeepleMass You can use the mobilenetv2 model available in the link https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_classification_zoo.md
Line 146 'classification' is correct when you use this model.

@sambhusuryamohan
Copy link

@DeepleMass
Forgot to add the link for mobilenetv2 was not working before because of typo. Here is the correct link
http://download.tensorflow.org/models/object_detection/classification/tf2/20200710/mobilenet_v2.tar.gz

@DeepleMass
Copy link
Author

@sambhusuryamohan
Hello Sam,
thanks for the reply. I will give this a try and provide you with a feed back.
Please consider that I'm on vacation for the two next weeks.
Best regards
Nicolas Windt
PS: what about @sainisanjay's contribution above?

@sambhusuryamohan
Copy link

@DeepleMass Sorry I am not able to understand what @sainisanjay is talking about. He is talking about some other issue here. I think he must have edited his post.

@ya0002
Copy link

ya0002 commented Jul 22, 2020

@DeepleMass You can use ckpt-0 as in tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0 since these are the models checkpoints.

Refer to this Colab notebook by Roboflow

@DeepleMass
Copy link
Author

@myjaa : I'm trying that too this week

@DeepleMass
Copy link
Author

DeepleMass commented Aug 13, 2020

I'm now facing the following situation:
I use

My command is :

python /tf/models/research/object_detection/model_main_tf2.py \
--train_dir efficientdet_d0/efficientdet_d0_coco17_tpu-32/train \
--model_dir efficientdet_d0/efficientdet_d0_coco17_tpu-32 \
--pipeline_config_path efficientdet_d0/efficientdet_d0_coco17_tpu-32/pipeline.config \
--alsologtostderr

my config is:

model {
  ssd {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 512
        max_dimension: 512
        pad_to_max_dimension: true
      }
    }
    feature_extractor {
      type: "ssd_efficientnet-b0_bifpn_keras"
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029999999329447746
          }
        }
        activation: SWISH
        batch_norm {
          decay: 0.9900000095367432
          scale: true
          epsilon: 0.0010000000474974513
        }
        force_use_bias: true
      }
      bifpn {
        min_level: 3
        max_level: 7
        num_iterations: 3
        num_filters: 64
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 1.0
        x_scale: 1.0
        height_scale: 1.0
        width_scale: 1.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.009999999776482582
            }
          }
          activation: SWISH
          batch_norm {
            decay: 0.9900000095367432
            scale: true
            epsilon: 0.0010000000474974513
          }
          force_use_bias: true
        }
        depth: 64
        num_layers_before_predictor: 3
        kernel_size: 3
        class_prediction_bias_init: -4.599999904632568
        use_depthwise: true
      }
    }
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        scales_per_octave: 3
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.5
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 1.5
          alpha: 0.25
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    add_background_class: false
  }
}
train_config {
  batch_size: 128
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_scale_crop_and_pad_to_square {
      output_size: 512
      scale_min: 0.10000000149011612
      scale_max: 2.0
    }
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.07999999821186066
          total_steps: 300000
          warmup_learning_rate: 0.0010000000474974513
          warmup_steps: 2500
        }
      }
      momentum_optimizer_value: 0.8999999761581421
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "efficientdet_d0/efficientdet_d0_coco17_tpu-32/checkpoint/ckpt-0.*"
  num_steps: 300000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "classification"
  use_bfloat16: true
  fine_tune_checkpoint_version: V2
}
train_input_reader: {
  label_map_path: "labelmap.pbtxt"
  tf_record_input_reader {
    input_path: "/tf/dataset/images/door/trainrecords/train.tfrecord-000*"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

eval_input_reader: {
  label_map_path: "labelmap.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "/tf/dataset/images/door/testrecords/test.tfrecord-000*"
  }
}

The error message is

lot of ugly stuff ...

2020-08-13 21:07:25.847213: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open efficientdet_d0/efficientdet_d0_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: Unable to open table file efficientdet_d0/efficientdet_d0_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tf/models/research/object_detection/model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/tf/models/research/object_detection/model_main_tf2.py", line 110, in main
    record_summaries=FLAGS.record_summaries)
  File "/tf/models/research/object_detection/model_lib_v2.py", line 564, in train_loop
    unpad_groundtruth_tensors)
  File "/tf/models/research/object_detection/model_lib_v2.py", line 340, in load_fine_tune_checkpoint
    if not is_object_based_checkpoint(checkpoint_path):
  File "/tf/models/research/object_detection/model_lib_v2.py", line 303, in is_object_based_checkpoint
    var_names = [var[0] for var in tf.train.list_variables(checkpoint_path)]
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 112, in list_variables
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 67, in load_checkpoint
    return py_checkpoint_reader.NewCheckpointReader(filename)
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 99, in NewCheckpointReader
    error_translator(e)
  File "/root/anaconda3/envs/doorDetTf2/lib/python3.7/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 44, in error_translator
    raise errors_impl.DataLossError(None, None, error_message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file efficientdet_d0/efficientdet_d0_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

This may be a different issue.
Thank you for giving me a comment on this

@abhishekbalu
Copy link

abhishekbalu commented Feb 13, 2021

@DeepleMass

So the issue is the format in which ur fine-tuning path is set. Try using the following

"efficientdet_d0/efficientdet_d0_coco17_tpu-32/checkpoint/model.ckpt-*" if you have already trained the model locally. or "efficientdet_d0/efficientdet_d0_coco17_tpu-32/checkpoint/model.ckpt" if you got a fresh model zip file from tf model zoo.

@VeeranjaneyuluToka
Copy link

when we are training on GPU, do we need to set use_bfloat16 to False?

@jaeyounkim jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants