ssd mobilenet v3 quantization-aware training failed #8331

NobuoTsukamoto · 2020-03-26T13:28:09Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 18.04.3 LTS (google colab)
Mobile device (e.g., Pixel 4, Samsung Galaxy 10) if the issue happens on mobile device:No
TensorFlow installed from (source or binary):binary
TensorFlow version (use command below):1.15, 1.15.2
Python version:3.6.9
Bazel version (if compiling from source): -
GCC/Compiler version (if compiling from source): -
CUDA/cuDNN version:10.1 / 7.6.3
GPU model and memory:Tesla T4 / 15079MiB

Please provide the entire URL of the model you are using?
https://github.com/tensorflow/models/tree/master/research/object_detection

Describe the current behavior
ssdlite_mobilenet_v3 quantization-aware training results in the following error. After starting the training, tf.train.saver seems to cause an error.

$ python ./object_detection/model_main.py \
    --alsologtostderr \
    --pipeline_config_path=${DATA_DIR}/ssdlite_mobilenet_v3_large_320x320_pet_quant.config \
    --num_train_steps=500000 \
    --sample_1_of_n_eval_examples=1 \
    --model_dir=${DATA_DIR}/train_ssdlite_mobilenet_v3_large_320x320_pet_quant

... start training ...

W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key FeatureExtractor/MobilenetV3/Conv/conv_quant/max not found in checkpoint
Traceback (most recent call last):
  File "/tensorflow-1.15.0/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/tensorflow-1.15.0/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/tensorflow-1.15.0/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: Key FeatureExtractor/MobilenetV3/Conv/conv_quant/max not found in checkpoint
	 [[{{node save/RestoreV2}}]]
  (1) Not found: Key FeatureExtractor/MobilenetV3/Conv/conv_quant/max not found in checkpoint
	 [[{{node save/RestoreV2}}]]
	 [[save/RestoreV2/_301]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:
............

I have code that reproduces the error on google colaboratory.

Train the model using the Oxford-IIIT Pets Dataset.
Modify the sample ssdlite_mobilenet_v3_large_320x320_coco.config file.
- Modify labelmap and tf-record.
- Added "graph_rewriter" for quantization-aware training.

Complete quantization-aware training of mobilenet v3 of image classification model and deeplab model succeeds. Only the object detection model fails.

Describe the expected behavior
Complete quantization learning succeeds.

Code to reproduce the issue
https://gist.github.com/NobuoTsukamoto/b2ca173b62e933ceeb1c7f0df42bca5f

Other info / logs
log.txt

The text was updated successfully, but these errors were encountered:

zheyangshi · 2020-03-27T07:26:23Z

same

skavulya · 2020-04-01T18:50:34Z

I ran into the same issue and resolved it by using the the ssd_mobilenet_edgetpu_coco checkpoint instead of the ssd_mobilenet_v3_large_coco checkpoint. The edgetpu checkpoint also uses a mobilenetV3 but with operations tailored for edge deployment.

Jove125 · 2020-04-01T21:23:06Z

I have the same issue and I doesn't want to use checkpoint (I need another resolution / train own model).

sunzhe09 · 2020-04-02T03:35:02Z

@skavulya how about it train on mobilenetv3 small ，I found the edgetpu checkpoint still met the same problem

Jove125 · 2020-04-02T11:59:37Z

I found the cause of the error. If you comment or remove the following line from the config file, then the error will disappear at validation (training).
inplace_batchnorm_update: true

I don’t know what these changes will affect. Does anyone know what this setting is for? To increase learning speed?

P.S.
I have not tried to export / use the model yet.

skavulya · 2020-04-02T16:20:16Z

I used both the edgetpu checkpoint and pipeline config file for training. I exported the quantized int8 model to tflite and it looks good. The main difference between the mobilenetv3 small/large and ssd edgetpu is that the edgetpu uses the ssd_mobilenet_edgetpu feature extractor.

The feature extractor also uses mobilenetv3 but has conv_defs=mobilenet_v3.V3_EDGETPU and from_layer=['layer_18/expansion_output', 'layer_23']. They say the conv_def is an EdgeTPU friendly variant of MobilenetV3 that uses fused convolutions instead of depthwise in the early layers.

I think you can extend the edge tpu feature extractor and modify the from_layers to change the size of your mobilenet to a smaller one.

The edgetpu pipeline file has the inplace_batchnorm_update: true so I am not sure if that makes a difference.

sunzhe09 · 2020-04-09T01:37:39Z

@skavulya @Jove125 Thanks,after set inplace_batchnorm_update: false ,I make it.

yahiya6006 · 2020-08-24T21:57:32Z

The way i did was just deleted the checkpoint from the folder and added these lines to the config file

fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: true

It actually restores the parameters from the provided model.ckpt file of mobilenet v3 and creates a new checkpoint file.

My training actually started and i did trained it for about 23k steps and got a loss of about ~0.2445.
But the problem is that the model just provides too many detections and all the detections are not correct but the label for the detection is correct. Any suggestions on how to solve this issue.

oliver8459 · 2020-11-24T09:54:39Z

@Jove125 @sunzhe09 After set inplace_batchnorm_update: false, the error was gone. But the mAP can not increase(wave between 1e-6~1e-4), it seems the weight didn't update. Have you meet this problem?

ngotra2710 · 2021-01-29T04:41:53Z

I set inplace_batchnorm_update: false and train normally. However, I want a quantization model. If I set false at the variable. The model is not quantized. Can someone show that how to fix the config (even inside the code) to make quantized ssdmobilenetv3?

HongsongLi0728 · 2021-04-23T00:29:58Z

Hi @oliver8459 I'm now facing the same issue as you did, have you figured out how to fix it?

het-grubbrr · 2021-08-19T07:29:15Z

Hi, any updates on this?
@NobuoTsukamoto were you able to fix this?

animeesh · 2021-12-14T09:11:27Z

pre-trained SSD-MobileNet V3 is trained on which data set ?

google-ml-butler · 2022-11-17T12:17:27Z

Are you satisfied with the resolution of your issue?
Yes
No

Petros626 · 2023-02-28T07:05:59Z

The way i did was just deleted the checkpoint from the folder and added these lines to the config file

fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt" from_detection_checkpoint: true

It actually restores the parameters from the provided model.ckpt file of mobilenet v3 and creates a new checkpoint file.

My training actually started and i did trained it for about 23k steps and got a loss of about ~0.2445. But the problem is that the model just provides too many detections and all the detections are not correct but the label for the detection is correct. Any suggestions on how to solve this issue.

always check the label order of generated TFRecord file and the labelmap.txt file, which is been loaded for detection

Petros626 · 2023-02-28T07:06:45Z

pre-trained SSD-MobileNet V3 is trained on which data set ?

maybe COCO or another big dataset

Petros626 · 2023-02-28T07:18:38Z

In general to get a full quantized TFLite model, the model you using for transfer learning must be a model, with quantized weights and activation layers otherwise, forget Quantization Aware Training. You can do only the following:

Implement the desired architecture (they're several repos) and train it from scratch with quantization on your prefered dataset (e.g. COCO, Kitti, Open images dataset)
use Post Quantization

this could be helpful to read: https://github.com/tensorflow/tensorflow/tree/r1.15/tensorflow/contrib/quantize

NobuoTsukamoto added models:research models that come under research directory type:bug Bug in the code labels Mar 26, 2020

jaeyounkim assigned jch1 Apr 2, 2020

jaeyounkim added this to Needs triage in Object Detection May 8, 2020

jaeyounkim assigned pkulzc May 9, 2020

pkulzc assigned marksandler2 and unassigned pkulzc May 9, 2020

Apollo-XI mentioned this issue Jun 30, 2020

MobileDet CPU doesn't support QAT training #8734

Open

3 tasks

m-parchami mentioned this issue Sep 5, 2020

Mobilenet V3 + SSD PINTO0309/PINTO_model_zoo#30

Closed

jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021

NobuoTsukamoto closed this as completed Nov 17, 2022

Object Detection automation moved this from Needs triage (Issues) to Closed Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ssd mobilenet v3 quantization-aware training failed #8331

ssd mobilenet v3 quantization-aware training failed #8331

NobuoTsukamoto commented Mar 26, 2020

zheyangshi commented Mar 27, 2020

skavulya commented Apr 1, 2020

Jove125 commented Apr 1, 2020

sunzhe09 commented Apr 2, 2020

Jove125 commented Apr 2, 2020

skavulya commented Apr 2, 2020

sunzhe09 commented Apr 9, 2020

yahiya6006 commented Aug 24, 2020

oliver8459 commented Nov 24, 2020

ngotra2710 commented Jan 29, 2021

HongsongLi0728 commented Apr 23, 2021

het-grubbrr commented Aug 19, 2021

animeesh commented Dec 14, 2021

google-ml-butler bot commented Nov 17, 2022

Petros626 commented Feb 28, 2023

Petros626 commented Feb 28, 2023

Petros626 commented Feb 28, 2023

ssd mobilenet v3 quantization-aware training failed #8331

ssd mobilenet v3 quantization-aware training failed #8331

Comments

NobuoTsukamoto commented Mar 26, 2020

zheyangshi commented Mar 27, 2020

skavulya commented Apr 1, 2020

Jove125 commented Apr 1, 2020

sunzhe09 commented Apr 2, 2020

Jove125 commented Apr 2, 2020

skavulya commented Apr 2, 2020

sunzhe09 commented Apr 9, 2020

yahiya6006 commented Aug 24, 2020

oliver8459 commented Nov 24, 2020

ngotra2710 commented Jan 29, 2021

HongsongLi0728 commented Apr 23, 2021

het-grubbrr commented Aug 19, 2021

animeesh commented Dec 14, 2021

google-ml-butler bot commented Nov 17, 2022

Petros626 commented Feb 28, 2023

Petros626 commented Feb 28, 2023

Petros626 commented Feb 28, 2023