Added EfficientDet

JunweiLiang · JunweiLiang · commit 51a136d426b7 · 2020-04-07T00:50:38.000-04:00
diff --git a/COMMANDS.md b/COMMANDS.md
@@ -14,3 +14,43 @@ $ python obj_detect_tracking.py \
 ```
 This is for processing AVI videos. For MP4 videos, run without `--use_lijun`.
 Add `--log_time_and_gpu` to get GPU utilization and time profile.
+
+
+## 04-2020, added EfficientDet
+The [EfficientDet (CVPR 2020)](https://github.com/google/automl/tree/master/efficientdet) is reported to be more than 14 mAP better than the Resnet-50 FPN model we used on COCO.
+
+I have made the following changes based on the code from early March:
++ The original code assumes width==height and it will pad (1280x720) frame to (1280x1280) at the beginning, which wastes much computation. See [this issue](https://github.com/google/automl/issues/162). This is an easy fix. Note that I make sure the image sizes are multipliers of 128 (2^7) with some paddings. So (1280x720) inputs would be (1280x768).
++ Added multi-level ROI align with the final detection boxes since we need the FPN box features for deep-SORT tracking. Basically since one-stage object detection models have box predictions at each feature level, I added a level index variable to keep track of each box's feature level so that in the end they can be efficiently backtracked to the original feature map and crop the features.
++ Similar to the MaskRCNN model, I modified the EfficientDet to allow NMS on only some of the COCO classes (currently we only care about person and vehicle) and save computations.
++ Separate the tf.py_func stuff since this part of the graph cannot be saved to a .pb model. (The official EfficientDet code is still actively being developed and this problem seems to have been solved. Will look into this later.)
+
+Example command \[[d0 model from early March](https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/efficientdet-d0.tar.gz)\]:
+```
+$ python obj_detect_tracking.py \
+ --model_path efficientdet-d0 \
+ --efficientdet_modelname efficientdet-d0 --is_efficientdet \
+ --efficientdet_max_detection_topk 1000 \
+ --video_dir videos --tracking_dir output/ --video_lst_file videos.lst \
+ --version 2 --is_coco_model --use_partial_classes  --frame_gap 8 \
+ --get_tracking --tracking_objs Person,Vehicle --min_confidence 0.6 \
+ --max_size 1280 --short_edge_size 720 \
+ --use_lijun_video_loader --nms_max_overlap 0.85 --max_iou_distance 0.5 \
+ --max_cosine_distance 0.5 --nn_budget 5
+```
+This is for processing AVI videos. For MP4 videos, run without `--use_lijun`.
+Add `--log_time_and_gpu` to get GPU utilization and time profile.
+
+Example command with a partial frozen graph \[[d0-TFv1.15](https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/efficientd0_tfv1.15_1280x720.pb)\]:
+```
+$ python obj_detect_tracking.py \
+ --model_path efficientd0_tfv1.15_1280x720.pb --is_load_from_pb \
+ --efficientdet_modelname efficientdet-d0 --is_efficientdet \
+ --efficientdet_max_detection_topk 1000 \
+ --video_dir videos --tracking_dir output/ --video_lst_file videos.lst \
+ --version 2 --is_coco_model --use_partial_classes  --frame_gap 8 \
+ --get_tracking --tracking_objs Person,Vehicle --min_confidence 0.6 \
+ --max_size 1280 --short_edge_size 720 \
+ --use_lijun_video_loader --nms_max_overlap 0.85 --max_iou_distance 0.5 \
+ --max_cosine_distance 0.5 --nn_budget 5
+```
diff --git a/README.md b/README.md
@@ -35,9 +35,11 @@ We utilize state-of-the-art object detection and tracking algorithm in surveilla
 </div>
 
 ## Updates
-[02/2020] We used Resnet-50 FPN model trained on MS-COCO for [MEVA](http://mevadata.org/) activity detection and got a competitive pAUDC of [0.49](images/inf_actev_0.49audc_02-2020.png) on the [leaderboard](https://actev.nist.gov/sdl) with a total processing speed of 0.64x real-time on a 4-GPU machine. The object detection module's processing speed is about 0.125x real-time. \[[Frozen Model](https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/obj_coco_resnet50_partial_tfv1.14_1280x720_rpn300.pb)\] \[[Example Command](COMMANDS.md)\]
++ [04/2020] Added [EfficientDet (CVPR 2020)](https://github.com/google/automl/tree/master/efficientdet) for inferencing. It is reported to be more than 14 mAP better than the Resnet-50 FPN model we used. Modified to be more efficient and tested with Python 2 & 3 and TF 1.15. See example commands [here](COMMANDS.md).
 
-[01/2020] We discovered a problem with using OpenCV to extract frames for avi videos. Some avi videos have duplicate frames that are not physically presented in the files but only text instructions to duplicate previous frames. The problem is that OpenCV skip these frames without warning according to [this bug report](https://github.com/opencv/opencv/issues/9053) and [here](https://stackoverflow.com/questions/44488636/opencv-reading-frames-from-videocapture-advances-the-video-to-bizarrely-wrong-l/44551037). Therefore with OpenCV you may get fewer frames which causes the frame index of detection results to be incorrect. Solution: 1. convert the avi videos to mp4 format; 2. use MoviePy or [Lijun's video loader](https://github.com/Lijun-Yu/diva_io) (based on PyAV) but they are 10% ~ 30% slower than OpenCV frame extraction. See `obj_detect_tracking.py` for implementation.
++ [02/2020] We used Resnet-50 FPN model trained on MS-COCO for [MEVA](http://mevadata.org/) activity detection and got a competitive pAUDC of [0.49](images/inf_actev_0.49audc_02-2020.png) on the [leaderboard](https://actev.nist.gov/sdl) with a total processing speed of 0.64x real-time on a 4-GPU machine. The object detection module's processing speed is about 0.125x real-time. \[[Frozen Model](https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/obj_coco_resnet50_partial_tfv1.14_1280x720_rpn300.pb)\] \[[Example Command](COMMANDS.md)\]
+
++ [01/2020] We discovered a problem with using OpenCV to extract frames for avi videos. Some avi videos have duplicate frames that are not physically presented in the files but only text instructions to duplicate previous frames. The problem is that OpenCV skip these frames without warning according to [this bug report](https://github.com/opencv/opencv/issues/9053) and [here](https://stackoverflow.com/questions/44488636/opencv-reading-frames-from-videocapture-advances-the-video-to-bizarrely-wrong-l/44551037). Therefore with OpenCV you may get fewer frames which causes the frame index of detection results to be incorrect. Solution: 1. convert the avi videos to mp4 format; 2. use MoviePy or [Lijun's video loader](https://github.com/Lijun-Yu/diva_io) (based on PyAV) but they are 10% ~ 30% slower than OpenCV frame extraction. See `obj_detect_tracking.py` for implementation.
 
 ## Dependencies
 The code is originally written for Tensorflow v1.10 with Python 2/3 but it works on v1.13.1, too. Note that I didn't change the code for v1.13.1 instead I just disable Tensorflow warnings and logging. I have also tested this on tf v1.14.0 (ResNeXt backbone will need >=1.14 for group convolution support).
diff --git a/obj_detect_tracking.py b/obj_detect_tracking.py
@@ -20,6 +20,11 @@
 import logging
 logging.getLogger("tensorflow").disabled = True
 
+import matplotlib
+# avoid the warning "gdk_cursor_new_for_display:
+# assertion 'GDK_IS_DISPLAY (display)' failed" with Python 3
+matplotlib.use('Agg')
+
 from tqdm import tqdm
 
 import numpy as np