Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssd mobilenet v3 quantization-aware training failed #8331

Closed
NobuoTsukamoto opened this issue Mar 26, 2020 · 17 comments
Closed

ssd mobilenet v3 quantization-aware training failed #8331

NobuoTsukamoto opened this issue Mar 26, 2020 · 17 comments
Assignees
Labels

Comments

@NobuoTsukamoto
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 18.04.3 LTS (google colab)
  • Mobile device (e.g., Pixel 4, Samsung Galaxy 10) if the issue happens on mobile device:No
  • TensorFlow installed from (source or binary):binary
  • TensorFlow version (use command below):1.15, 1.15.2
  • Python version:3.6.9
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version:10.1 / 7.6.3
  • GPU model and memory:Tesla T4 / 15079MiB

Please provide the entire URL of the model you are using?
https://github.com/tensorflow/models/tree/master/research/object_detection

Describe the current behavior
ssdlite_mobilenet_v3 quantization-aware training results in the following error. After starting the training, tf.train.saver seems to cause an error.

$ python ./object_detection/model_main.py \
    --alsologtostderr \
    --pipeline_config_path=${DATA_DIR}/ssdlite_mobilenet_v3_large_320x320_pet_quant.config \
    --num_train_steps=500000 \
    --sample_1_of_n_eval_examples=1 \
    --model_dir=${DATA_DIR}/train_ssdlite_mobilenet_v3_large_320x320_pet_quant

... start training ...

W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key FeatureExtractor/MobilenetV3/Conv/conv_quant/max not found in checkpoint
Traceback (most recent call last):
  File "/tensorflow-1.15.0/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/tensorflow-1.15.0/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/tensorflow-1.15.0/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: Key FeatureExtractor/MobilenetV3/Conv/conv_quant/max not found in checkpoint
	 [[{{node save/RestoreV2}}]]
  (1) Not found: Key FeatureExtractor/MobilenetV3/Conv/conv_quant/max not found in checkpoint
	 [[{{node save/RestoreV2}}]]
	 [[save/RestoreV2/_301]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:
............

I have code that reproduces the error on google colaboratory.

Complete quantization-aware training of mobilenet v3 of image classification model and deeplab model succeeds. Only the object detection model fails.

Describe the expected behavior
Complete quantization learning succeeds.

Code to reproduce the issue
https://gist.github.com/NobuoTsukamoto/b2ca173b62e933ceeb1c7f0df42bca5f

Other info / logs
log.txt

@NobuoTsukamoto NobuoTsukamoto added models:research models that come under research directory type:bug Bug in the code labels Mar 26, 2020
@zheyangshi
Copy link

same

@skavulya
Copy link

skavulya commented Apr 1, 2020

I ran into the same issue and resolved it by using the the ssd_mobilenet_edgetpu_coco checkpoint instead of the ssd_mobilenet_v3_large_coco checkpoint. The edgetpu checkpoint also uses a mobilenetV3 but with operations tailored for edge deployment.

@Jove125
Copy link

Jove125 commented Apr 1, 2020

I have the same issue and I doesn't want to use checkpoint (I need another resolution / train own model).

@sunzhe09
Copy link

sunzhe09 commented Apr 2, 2020

@skavulya how about it train on mobilenetv3 small ,I found the edgetpu checkpoint still met the same problem

@Jove125
Copy link

Jove125 commented Apr 2, 2020

I found the cause of the error. If you comment or remove the following line from the config file, then the error will disappear at validation (training).
inplace_batchnorm_update: true

I don’t know what these changes will affect. Does anyone know what this setting is for? To increase learning speed?

P.S.
I have not tried to export / use the model yet.

@skavulya
Copy link

skavulya commented Apr 2, 2020

I used both the edgetpu checkpoint and pipeline config file for training. I exported the quantized int8 model to tflite and it looks good. The main difference between the mobilenetv3 small/large and ssd edgetpu is that the edgetpu uses the ssd_mobilenet_edgetpu feature extractor.

The feature extractor also uses mobilenetv3 but has conv_defs=mobilenet_v3.V3_EDGETPU and from_layer=['layer_18/expansion_output', 'layer_23']. They say the conv_def is an EdgeTPU friendly variant of MobilenetV3 that uses fused convolutions instead of depthwise in the early layers.

I think you can extend the edge tpu feature extractor and modify the from_layers to change the size of your mobilenet to a smaller one.

The edgetpu pipeline file has the inplace_batchnorm_update: true so I am not sure if that makes a difference.

@sunzhe09
Copy link

sunzhe09 commented Apr 9, 2020

@skavulya @Jove125 Thanks,after set inplace_batchnorm_update: false ,I make it.

@yahiya6006
Copy link

The way i did was just deleted the checkpoint from the folder and added these lines to the config file

fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: true

It actually restores the parameters from the provided model.ckpt file of mobilenet v3 and creates a new checkpoint file.

My training actually started and i did trained it for about 23k steps and got a loss of about ~0.2445.
But the problem is that the model just provides too many detections and all the detections are not correct but the label for the detection is correct. Any suggestions on how to solve this issue.

@oliver8459
Copy link

@Jove125 @sunzhe09 After set inplace_batchnorm_update: false, the error was gone. But the mAP can not increase(wave between 1e-6~1e-4), it seems the weight didn't update. Have you meet this problem?

@ngotra2710
Copy link

I set inplace_batchnorm_update: false and train normally. However, I want a quantization model. If I set false at the variable. The model is not quantized. Can someone show that how to fix the config (even inside the code) to make quantized ssdmobilenetv3?

@HongsongLi0728
Copy link

Hi @oliver8459 I'm now facing the same issue as you did, have you figured out how to fix it?

@jaeyounkim jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021
@het-grubbrr
Copy link

Hi, any updates on this?
@NobuoTsukamoto were you able to fix this?

@animeesh
Copy link

pre-trained SSD-MobileNet V3 is trained on which data set ?

Object Detection automation moved this from Needs triage (Issues) to Closed Nov 17, 2022
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@Petros626
Copy link

The way i did was just deleted the checkpoint from the folder and added these lines to the config file

fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt" from_detection_checkpoint: true

It actually restores the parameters from the provided model.ckpt file of mobilenet v3 and creates a new checkpoint file.

My training actually started and i did trained it for about 23k steps and got a loss of about ~0.2445. But the problem is that the model just provides too many detections and all the detections are not correct but the label for the detection is correct. Any suggestions on how to solve this issue.

always check the label order of generated TFRecord file and the labelmap.txt file, which is been loaded for detection

@Petros626
Copy link

pre-trained SSD-MobileNet V3 is trained on which data set ?

maybe COCO or another big dataset

@Petros626
Copy link

In general to get a full quantized TFLite model, the model you using for transfer learning must be a model, with quantized weights and activation layers otherwise, forget Quantization Aware Training. You can do only the following:

  1. Implement the desired architecture (they're several repos) and train it from scratch with quantization on your prefered dataset (e.g. COCO, Kitti, Open images dataset)
  2. use Post Quantization

this could be helpful to read: https://github.com/tensorflow/tensorflow/tree/r1.15/tensorflow/contrib/quantize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Object Detection
  
Closed
Development

No branches or pull requests